Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Intro & Agenda Crawler Intro & Problem
Statements
Crawler Architecture
Infrastructure: Akka Streams,
Kafka, etc.
The Goodies

Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
High-Level View

Requirements Ever-expanding # of URLs
Can’t crawl all URLs at once
Control over concurrent web GETs
Efficient resource usage
Resilient under high burst
Scales horizontally & vertically

Sizing the Crawl Job
Let:
i = Number of crawl URLs in a job
n = Average number of links per page
d = The crawl depth
(how many layers to follow links)
u = The max number of URLs to process
Then:
u = ind
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
0 2 4 6 8 10 12
totalURLs vs depth
depth (initialURLs = 1, outLinks = 5)
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
totalURLs vs initialURLs
initialURLs (depth = 5, outLinks = 5)

The Reactive Manifesto
Responsive
Message Driven
Elastic Resilient

Why Does it Matter?
Respond in a deterministic, timely manner
Stays responsive in the face of failure – even cascading failures
Stays responsive under workload spikes
Basic building block for responsive, resilient, and elastic systems
Responsive
Resilient
Elastic
Message Driven

The Right Ingredients
• Kafka
• Huge persistent buffer for the bursts
• Load distribution to very large number of
processing nodes
• Enable horizontal scalability
• Akka streams
• High performance, highly efficient processing
pipeline
• Resilient with end-to-end back-pressure
• Fully asynchronous – utilizes mapAsyncUnordered
with Async HTTP client
• Async HTTP client
• Non-blocking and consumes no threads in waiting
• Integrates with Akka Streams for a high
parallelism, low resource solution
Efficient
Resilient
Scale
Akka
Stream
Async
HTTP
Reactive
Kafka

Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
Adding Kafka & Akka Streams
URLs
Akka Streams

Akka Streams,
what???
High performance, pure async,
stream processing
Conforms to reactive streams
Simple, yet powerful GraphDSL
allows clear stream topology
declaration
Central point to understand
processing pipeline

Crawl Stream
Actual Stream Declaration in Code
prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink
bCast ~> dataSinkFlow ~> kafkaDataSink
bCast ~> hdfsDataSink
bCast ~> graphFlow ~> merge ~> graphSink
bCast0 ~> maxPage ~> merge
bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge
bCastRetry ~> errorSink
Prioritized
Source
Crawl
Result
MaxPageReached
Retry
OutLinks
Data
Graph
CheckFail
CheckErr
OutLinks
Sink
Kafka Data
Sink
HDFS Data
Sink
Graph
Sink
Error
Sink

Resulting Characteristics
Efficient
• Low thread count, controlled by Akka and pure non-blocking async HTTP
• High latency URLs do not block low latency URLs using MapAsyncUnordered
• Well-controlled download concurrency using MapAsyncUnordered
• Thread per concurrent crawl job
Resilient
• Processes only what can be processed – no resource overload
• Kafka as short-term, persistent queue
Scale
• Kafka feeds next batch of URLs to available node cluster
• Pull model – only processes that have capacity will get the load
• Kafka distributes work to large number of processing nodes in cluster

Back-Pressure
0
20000
40000
60000
80000
100000
120000
0 100 200 300 400 500 600 700
Queue Size
Time (seconds)
0
200
400
URLs/sec
Time (seconds)
initialURLs : 100
parallelism : 1000
processTime : 1 – 5
s
outLinks : 0 - 10
depth : 5
totalCrawled :
312500

Challenges
Training
• Developers not used to E2E stream
definitions
• More familiar with deeply nested function
calls
Maturity of Infrastructure
• Kafka 0.9 use fetch as heartbeat
• Slow nodes cause timeout & rebalance
• Solved in 0.10

What it would
have been…
Bloated, ineffective concurrency
control
Lack of well-thought-out and visible
processing pipeline
Clumsy code, hard to manage &
understand
Low training cost, high project TCO
Dev / Support / Maintenance

Standardized Reactive Platform

Efficiency & Resilience meets Standardization
• Monitoring
• Need to collect metrics, consistently
• Logging
• Correlation across services
• Uniformity in logs
• Security
• Need to apply standard security configuration
• Environment Resolution
• Staging, production, etc.
Consistency in the face of Heterogeneity

squbs is not… A framework by its own
A programming model – use Akka
Take all or none –
Components/patterns can mostly be
used independently

squbs
Akka for large
scale deployments
Bootstrap
Lifecycle management
Loosely-coupled module system
Integration hooks for logging,
monitoring, ops integration

squbs
Akka for large
scale deployments
JSON console
HttpClient with pluggable resolver and
monitoring/logging hooks
Test tools and interfaces
Goodies:
- Activators for Scala & Java
- Programming patterns and helpers for
Akka and Akka Stream Use cases…,
and growing

PerpetualStream
• Provides a convenience trait to help
write streams controlled by system
lifecycle
• Minimal/no message losses
• Register PerpetualStream to make
stream start/stop
• Provides customization hooks –
especially for how to stop the stream
• Provides killSwitch (from Akka) to be
embedded into stream
• Implementers - just provide your
stream!
A non-stop stream; starts and stops with the system
class MyStream extends PerpetualStream[Future[Int]] {
def generator = Iterator.iterate(0) { p =>
if (p == Int.MaxValue) 0 else p + 1
}
val source = Source.fromIterator(generator _)
val ignoreSink = Sink.ignore[Int]
override def streamGraph = RunnableGraph.fromGraph(
GraphDSL.create(ignoreSink) { implicit builder =>
sink =>
import GraphDSL.Implicits._
source ~> killSwitch.flow[Int] ~> sink
ClosedShape
})
}

PersistentBuffer/BroadcastBuffer
• Data & indexes in rotating memory-mapped files
• Off-heap rotating file buffer – very large buffers
• Restarts gracefully with no or minimal message loss
• Not as durable as a remote data store, but much faster
• Does not back-pressure upstream beyond data/index writes
• Similar usage to Buffer and Broadcast
• BroadcastBuffer – a FanOutShape decouples each output port making each downstream
independent
• Useful if downstream stage blocked or unavailable
• Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming
traffic
• Optional commit stage for at-least-once delivery semantics
• Implementation based on Chronicle Queue
A buffer of virtually unlimited size

Summary
• Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts
& High Efficiency
• Akka Streams
• Clear view of stream topology
• Back-pressure & Kafka allows buffering load bursts
• Standardization
• Walk like a duck, quack like a duck, and manage it like a duck
• squbs: Have the cake, and eat it too, with goodies like
• PerpetualStream
• PersistentBuffer
• BroadcastBuffer

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Similar to Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka (20)

Recently uploaded

Recently uploaded (20)

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka