Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Back-Pressure in
Action
Intro & Agenda Crawler Intro & Problem
Statements
Crawler Architecture
Infrastructure: Akka Streams,
Kafka, etc.
The Goodi...
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
High-Level View
Requirements Ever-expanding # of URLs
Can’t crawl all URLs at once
Control over concurrent web GETs
Efficient resource usa...
Sizing the Crawl Job
Let:
i = Number of crawl URLs in a job
n = Average number of links per page
d = The crawl depth
(how ...
The Reactive Manifesto
Responsive
Message Driven
Elastic Resilient
Why Does it Matter?
Respond in a deterministic, timely manner
Stays responsive in the face of failure – even cascading fai...
The Right Ingredients
• Kafka
• Huge persistent buffer for the bursts
• Load distribution to very large number of
processi...
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
Adding Kafka & Akka Streams
URLs
Akka Streams
Akka Streams,
what???
High performance, pure async,
stream processing
Conforms to reactive streams
Simple, yet powerful Gr...
Crawl Stream
Actual Stream Declaration in Code
prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlo...
Resulting Characteristics
Efficient
• Low thread count, controlled by Akka and pure non-blocking async HTTP
• High latency...
Back-Pressure
0
20000
40000
60000
80000
100000
120000
0 100 200 300 400 500 600 700
Queue Size
Time (seconds)
0
200
400
UR...
Challenges
Training
• Developers not used to E2E stream
definitions
• More familiar with deeply nested function
calls
Matu...
What it would
have been…
Bloated, ineffective concurrency
control
Lack of well-thought-out and visible
processing pipeline...
Bottom Line
Standardized Reactive Platform
Efficiency & Resilience meets Standardization
• Monitoring
• Need to collect metrics, consistently
• Logging
• Correlation...
squbs is not… A framework by its own
A programming model – use Akka
Take all or none –
Components/patterns can mostly be
u...
squbs
Akka for large
scale deployments
Bootstrap
Lifecycle management
Loosely-coupled module system
Integration hooks for ...
squbs
Akka for large
scale deployments
JSON console
HttpClient with pluggable resolver and
monitoring/logging hooks
Test t...
PerpetualStream
• Provides a convenience trait to help
write streams controlled by system
lifecycle
• Minimal/no message l...
PersistentBuffer/BroadcastBuffer
• Data & indexes in rotating memory-mapped files
• Off-heap rotating file buffer – very l...
Summary
• Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts
& High Efficiency
• Akka Streams
• Clear v...
Q&A – Feedback Appreciated
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Upcoming SlideShare
Loading in …5
×

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

509 views

Published on

Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.

This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.

In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

  1. 1. Back-Pressure in Action
  2. 2. Intro & Agenda Crawler Intro & Problem Statements Crawler Architecture Infrastructure: Akka Streams, Kafka, etc. The Goodies
  3. 3. Crawl Jobs Job DB Validate URL Cache Downloa d Process URLs URLs Timestamps High-Level View
  4. 4. Requirements Ever-expanding # of URLs Can’t crawl all URLs at once Control over concurrent web GETs Efficient resource usage Resilient under high burst Scales horizontally & vertically
  5. 5. Sizing the Crawl Job Let: i = Number of crawl URLs in a job n = Average number of links per page d = The crawl depth (how many layers to follow links) u = The max number of URLs to process Then: u = ind 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 0 2 4 6 8 10 12 totalURLs vs depth depth (initialURLs = 1, outLinks = 5) 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1.00E+11 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 totalURLs vs initialURLs initialURLs (depth = 5, outLinks = 5)
  6. 6. The Reactive Manifesto Responsive Message Driven Elastic Resilient
  7. 7. Why Does it Matter? Respond in a deterministic, timely manner Stays responsive in the face of failure – even cascading failures Stays responsive under workload spikes Basic building block for responsive, resilient, and elastic systems Responsive Resilient Elastic Message Driven
  8. 8. The Right Ingredients • Kafka • Huge persistent buffer for the bursts • Load distribution to very large number of processing nodes • Enable horizontal scalability • Akka streams • High performance, highly efficient processing pipeline • Resilient with end-to-end back-pressure • Fully asynchronous – utilizes mapAsyncUnordered with Async HTTP client • Async HTTP client • Non-blocking and consumes no threads in waiting • Integrates with Akka Streams for a high parallelism, low resource solution Efficient Resilient Scale Akka Stream Async HTTP Reactive Kafka
  9. 9. Crawl Jobs Job DB Validate URL Cache Downloa d Process URLs URLs Timestamps Adding Kafka & Akka Streams URLs Akka Streams
  10. 10. Akka Streams, what??? High performance, pure async, stream processing Conforms to reactive streams Simple, yet powerful GraphDSL allows clear stream topology declaration Central point to understand processing pipeline
  11. 11. Crawl Stream Actual Stream Declaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink Prioritized Source Crawl Result MaxPageReached Retry OutLinks Data Graph CheckFail CheckErr OutLinks Sink Kafka Data Sink HDFS Data Sink Graph Sink Error Sink
  12. 12. Resulting Characteristics Efficient • Low thread count, controlled by Akka and pure non-blocking async HTTP • High latency URLs do not block low latency URLs using MapAsyncUnordered • Well-controlled download concurrency using MapAsyncUnordered • Thread per concurrent crawl job Resilient • Processes only what can be processed – no resource overload • Kafka as short-term, persistent queue Scale • Kafka feeds next batch of URLs to available node cluster • Pull model – only processes that have capacity will get the load • Kafka distributes work to large number of processing nodes in cluster
  13. 13. Back-Pressure 0 20000 40000 60000 80000 100000 120000 0 100 200 300 400 500 600 700 Queue Size Time (seconds) 0 200 400 URLs/sec Time (seconds) initialURLs : 100 parallelism : 1000 processTime : 1 – 5 s outLinks : 0 - 10 depth : 5 totalCrawled : 312500
  14. 14. Challenges Training • Developers not used to E2E stream definitions • More familiar with deeply nested function calls Maturity of Infrastructure • Kafka 0.9 use fetch as heartbeat • Slow nodes cause timeout & rebalance • Solved in 0.10
  15. 15. What it would have been… Bloated, ineffective concurrency control Lack of well-thought-out and visible processing pipeline Clumsy code, hard to manage & understand Low training cost, high project TCO Dev / Support / Maintenance
  16. 16. Bottom Line
  17. 17. Standardized Reactive Platform
  18. 18. Efficiency & Resilience meets Standardization • Monitoring • Need to collect metrics, consistently • Logging • Correlation across services • Uniformity in logs • Security • Need to apply standard security configuration • Environment Resolution • Staging, production, etc. Consistency in the face of Heterogeneity
  19. 19. squbs is not… A framework by its own A programming model – use Akka Take all or none – Components/patterns can mostly be used independently
  20. 20. squbs Akka for large scale deployments Bootstrap Lifecycle management Loosely-coupled module system Integration hooks for logging, monitoring, ops integration
  21. 21. squbs Akka for large scale deployments JSON console HttpClient with pluggable resolver and monitoring/logging hooks Test tools and interfaces Goodies: - Activators for Scala & Java - Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing
  22. 22. PerpetualStream • Provides a convenience trait to help write streams controlled by system lifecycle • Minimal/no message losses • Register PerpetualStream to make stream start/stop • Provides customization hooks – especially for how to stop the stream • Provides killSwitch (from Akka) to be embedded into stream • Implementers - just provide your stream! A non-stop stream; starts and stops with the system class MyStream extends PerpetualStream[Future[Int]] { def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int] override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape }) }
  23. 23. PersistentBuffer/BroadcastBuffer • Data & indexes in rotating memory-mapped files • Off-heap rotating file buffer – very large buffers • Restarts gracefully with no or minimal message loss • Not as durable as a remote data store, but much faster • Does not back-pressure upstream beyond data/index writes • Similar usage to Buffer and Broadcast • BroadcastBuffer – a FanOutShape decouples each output port making each downstream independent • Useful if downstream stage blocked or unavailable • Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming traffic • Optional commit stage for at-least-once delivery semantics • Implementation based on Chronicle Queue A buffer of virtually unlimited size
  24. 24. Summary • Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts & High Efficiency • Akka Streams • Clear view of stream topology • Back-pressure & Kafka allows buffering load bursts • Standardization • Walk like a duck, quack like a duck, and manage it like a duck • squbs: Have the cake, and eat it too, with goodies like • PerpetualStream • PersistentBuffer • BroadcastBuffer
  25. 25. Q&A – Feedback Appreciated

×