Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gearpump akka streams

2,736 views

Published on

The Gearpump Materializer enabling distributed akka streams

Published in: Software
  • Be the first to comment

Gearpump akka streams

  1. 1. Implementing an akka-streams materializer for big data The Gearpump Materializer Kam Kasravi
  2. 2. Technical Presentation ● Familiarity with akka-streams flow and graph DSL’s ● Familiarity with big data and real time streaming platforms ● Familiarity with scala ● Effort between the akka-streams and Gearpump teams started late last year ● Resulted in a number of pull requests into akka-streams to enable different materializers ● Close to completion with good support of the akka-streams DSL (all GraphStages) ● Fairly seamless to switch between local and distributed
  3. 3. Who am I? ● Committer on Apache Gearpump (incubating) - http://gearpump.apache.org ● Architect on Trusted Analytics Platform (TAP) - http://trustedanalytics.org ● Lead or Architect across many companies, industries - NYSE, eBay, PayPal, Yahoo, ... Title Goes Here There are many variations of passages of lorem ipsum available, but the majority suffered alteration some form.
  4. 4. What is Apache Gearpump? ● Accepted into Apache incubator last March ● Similar to Apache Beam and Apache Flink (real-time message delivery) ● Heavily leverages the actor model and akka (more so than others) ● Unique features like dynamic DAG ● Excellent runtime visualization tooling of cluster and application DAGs ● One of the best big data performance profiles (both throughput, latency)
  5. 5. Agenda ● Why? ○ Why integrate akka-streams into a big data platform? ● Big Data platform evolving features ○ Functionality big data platforms are embracing ● Prerequisites needed for any Big Data platform ○ Minimal features a big data platform must have ● Big data platform integration challenges ○ What concepts do not map well within big data platforms? ● Object models: akka-streams, Gearpump ● Materialization ○ ActorMaterializer - materializing the module tree ○ GearpumpMaterializer - rewriting the module tree
  6. 6. Why? ● Akka-streams has limitations inherent within a single JVM ○ Throughput and latency are key big data features that require scaling beyond single JVM’s ● Akka-streams DSL is a superset of other big data platform DSLs ○ Has a logical plan (declarative) that can be transformed to an execution plan (runtime) ● Akka-streams programming paradigm is declarative, composable, extensible*, stackable* and reusable* * Provides a level of extensibility and functionality beyond most big data platform DSLs
  7. 7. Extensible ● Extend GraphStage ● Extend Source, Sink, Flow or BidiFlow ● All derive from Graph * Provides a level of extensibility and functionality beyond most big data platform DSLs
  8. 8. Stackable ● Another term for nestable or recursive. Reference to Kleisli (theoretical). ● Source, Sink, Flow or BidiFlow may contain their own topologies * Provides a level of extensibility and functionality beyond most big data platform DSLs
  9. 9. Reusable ● Graph topologies can be attached anywhere (any Graph) ● Recent akka-streams feature is dynamic attachment via hubs ● Hubs will take advantage of Gearpump dynamic DAG within the GearpumpMaterializer * Provides a level of extensibility and functionality beyond most big data platform DSLs
  10. 10. Big Data platform evolving features (1) ● Big data platforms are moving to consolidate disparate API’s ○ Too many APIs: Concord, Flink, Heron, Pulsar, Spark, Storm, Samza ○ Common DSL is also an approach being taken by Apache Beam ○ Analogy to SQL - common grammar that different platforms execute
  11. 11. Big Data platform evolving features (2) ● Big data platforms will increasingly require dynamic pipelines that are compositional and reusable ● Examples include: ○ Machine learning ○ IoT sensors
  12. 12. Big Data platform evolving features (3) ● Machine learning use cases ○ Replace or update scoring models ○ Model Ensembles ■ concept drift ■ data drift
  13. 13. Big Data platform evolving features (4) ● IoT use cases ○ Bring new sensors on line with no interruption ○ Change or update configuration parameters at remote sensors
  14. 14. Prerequisites needed for any Big Data platform (1) Downstream must be able to pull Upstream must be able to push 1. Push and Pull Downstream must be able to backpressure all the way to source 2. Backpressure << <<
  15. 15. Prerequisites needed for any Big Data platform (2) 3. Parallelization 4. Asynchronous 5. Bidirectional
  16. 16. Big data platform integration challenges (1) A number of GraphStages have completion or cancellation semantics. Big data pipelines are often infinite streams and do not complete. Cancel is often viewed as a failure. ● Balance[T] ● Completion[T] ● Merge[T] ● Split[T]
  17. 17. Big data platform integration challenges (2) A number of GraphStages have specific upstream and downstream ordering and timing directives. ● Batch[T] ● Concat[T] ● Delay[T] ● DelayInitial[T] ● Interleave[T]
  18. 18. Big data platform integration challenges (3) The async attribute as well as fusing do not map cleanly when distributing GraphStage functionality across machines. ● Graph.async ● Fusing
  19. 19. Graph.async ● Collapses multiple operations (GraphStageLogic) into one actor ● Distributed scenarios where one may want actors within the same JVM or on the same machine
  20. 20. Fusing ● Creates one or more islands delimited by async boundaries ● For distributed scenario no fusing should occur until the materializer can evaluate and optimize the execution plan
  21. 21. Object Models ● Akka-stream’s GraphStage, Module, Shape ● Gearpump’s Graph, Task, Partitioner
  22. 22. Akka-streams Object Model ↪ Base type is a Graph. Common base type is a GraphStage ↪ Graph contains a ↳ Module contains a ↳ Shape ↪ Only a RunnableGraph can be materialized ↪ A RunnableGraph needs at least one Source and one Sink
  23. 23. Akka-streams Graph[S, M] ● Graph is parameterized by ○ Shape ○ Materialized Value ● Graph contains a Module contains a Shape ○ Module is where the runtime is constructed and manipulated ● Graph’s first level subtypes provide basic functionality ○ Source ○ Sink ○ Flow ○ BidiFlow S M Graph Source Sink Flow BidiFlow Module Shape
  24. 24. GraphStage[S <: Shape] Graph GraphStageWithMaterializedValue GraphStage GraphStageModule Module
  25. 25. GraphStage[S <: Shape] subtypes (incomplete) ↳ Balance[T] ↳ Batch[In, Out] ↳ Broadcast[T] ↳ Collect[In, Out] ↳ Concat[T] ↳ DelayInitial[T] ↳ DropWhile[T] ↳ Expand[In, Out] ↳ FlattenMerge[T, M] ↳ Fold[In, Out] ↳ FoldAsync[T] ↳ FutureSource[T] ↳ GroupBy[T, K] ↳ Grouped[T] ↳ GroupedWithin[T] ↳ Interleave[T] ↳ Intersperse[T] ↳ LimitWeighted[T] ↳ Map[In, Out] ↳ MapAsync[In, Out] ↳ Merge[T] ↳ MergePreferred[T] ↳ MergeSorted[T] ↳ OrElse[T] ↳ Partition[T] ↳ PrefixAndTail[T] ↳ Recover[T] ↳ Scan[In, Out] ↳ SimpleLinearGraph[T] ↳ Sliding[T]
  26. 26. What about Module? ● Module is a recursive structure containing a Set[Modules] ● Module is a declarative data structure used as the AST ● Module is used to represent a graph of nodes and edges from the original GraphStages ● Module contains downstream and upstream ports (edges) ● Materializers walk the module tree to create and run instances of publishers and subscribers. ● Each publisher and subscriber is an actor (ActorGraphInterpreter)
  27. 27. Gearpump Object Model ↪ Graph[Node, Edge] holds ↳ Tasks (Node) ↳ Partitioners (Edge) ↪ This is a Gearpump Graph, not to be confused with akka-streams Graph.
  28. 28. Gearpump Graph[N<:Task, E<:Partitioner] ● Graph is parameterized by ○ Node - must be a subtype of Task ○ Edge - must be a subtype of Parititioner N E Graph List[Task] List[Partitioner]
  29. 29. Task Task GraphTask
  30. 30. GraphTask subtypes (incomplete) ↳ BalanceTask ↳ BatchTask[In, Out] ↳ BroadcastTask[T] ↳ CollectTask[In, Out] ↳ ConcatTask ↳ DelayInitialTask[T] ↳ DropWhileTask[T] ↳ ExpandTask[In, Out] ↳ FlattenMerge[T, M] ↳ FoldTask[In, Out] ↳ FutureSourceTask[T] ↳ GroupByTask[T, K] ↳ GroupedTask[T] ↳ GroupedWithinTask[T] ↳ InterleaveTask[T] ↳ IntersperseTask[T] ↳ LimitWeightedTask[T] ↳ MapTask[In, Out] ↳ MapAsyncTask[In, Out] ↳ MergeTask[T] ↳ OrElseTask[T] ↳ PartitionTask[T] ↳ PrefixAndTailTask[T] ↳ RecoverTask[T] ↳ ScanTask[In, Out] ↳ SlidingTask[T]
  31. 31. Materializer Variations 1. AST (module tree) is matched for every module type (GearpumpMaterializer) 2. AST (module tree) is matched for certain module types ○ After distribution - local ActorMaterializer is used for operations on that worker ○ Materializer works more as a distribution coordinator
  32. 32. Example 1 Source Broadcast Flow Merge Sink implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run()
  33. 33. Example 1 implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() Source Broadcast Flow Flow Merge GraphStages Sink class SinkActor extends Actor { def receive: Receive = { case any: Any => println(s“Confirm received: $any”) }
  34. 34. Example 1 Source Broadcast Flow Flow Merge GraphStages Sink Module Tree GraphStageModule GraphStageModule stage=SingleSource stage=StatefulMapConcat ActorRefSink stage=Broadcast stage=Map stage=Merge GraphStageModule GraphStageModule GraphStageModule
  35. 35. Example 1 implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() source broadcast flowA flowB merge GraphStages sink
  36. 36. Example 1 processing broadcasted element : 1 in flowA processing broadcasted element : 1 in flowB processing broadcasted element : 2 in flowA Confirm received: 1 Confirm received: 1 processing broadcasted element : 2 in flowB Confirm received: 2 Confirm received: 2 processing broadcasted element : 3 in flowA processing broadcasted element : 3 in flowB processing broadcasted element : 4 in flowA processing broadcasted element : 4 in flowB Confirm received: 3 Confirm received: 3 processing broadcasted element : 5 in flowA processing broadcasted element : 5 in flowB Confirm received: 4 Confirm received: 4 Confirm received: 5 Confirm received: 5 Confirm received: COMPLETE source broadcast flowA flowB merge GraphStages sink ActorMaterializer Output
  37. 37. Example 1 implicit val materializer = GearpumpMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() source broadcast flowA flowB merge GraphStages sink
  38. 38. Example 1 processing broadcasted element : 1 in flowA processing broadcasted element : 1 in flowB processing broadcasted element : 2 in flowB processing broadcasted element : 2 in flowA processing broadcasted element : 3 in flowB processing broadcasted element : 3 in flowA processing broadcasted element : 4 in flowB processing broadcasted element : 4 in flowA processing broadcasted element : 5 in flowB Confirm received: 1 processing broadcasted element : 5 in flowA Confirm received: 1 Confirm received: 2 Confirm received: 2 Confirm received: 3 Confirm received: 3 Confirm received: 4 Confirm received: 4 Confirm received: 5 Confirm received: 5 source broadcast flowA flowB merge GraphStages sink GearpumpMaterializer Output
  39. 39. Demo GraphStageModule( stage=SingleSource) ActorRefSinkGraphStageModule( stage=StatefulMapConcat) GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge)
  40. 40. ActorMaterializer GraphStageModule( stage=SingleSource) ActorRefSinkGraphStageModule( stage=StatefulMapConcat) GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge) 1. Traverses the Module Tree
  41. 41. ActorMaterializer 2. Builds a runtime graph of BoundaryPublisher and BoundarySubscribers (Reactive API). 3. Each Publisher or Subscriber contains an instance of GraphStageLogic specific to that GraphStage. 4. Each Publisher or Subscriber also contains an instance of ActorGraphInterpreter - an Actor that manages the message flow using GraphStageLogic.
  42. 42. GearpumpMaterializer GraphStageModule( stage=SingleSource) ActorRefSink GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge) 1. Rewrites the Module Tree into ‘local’ and ‘remote’ Gearpump Graphs. GraphStageModule( stage=StatefulMapConcat)
  43. 43. GearpumpMaterializer GraphStageModule( stage=SingleSource) ActorRefSink GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge) 2. Choice of ‘local’ and ‘remote’ is determined by a ‘Strategy’. The default Strategy is to put Source and Sink types in local GraphStageModule( stage=StatefulMapConcat)
  44. 44. GearpumpMaterializer ActorRefSink 3. Inserts BridgeModules into both Graphs SourceBridgeModule SinkBridgeModule SinkBridgeModule GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge)GraphStageModule( stage=StatefulMapConcat) GraphStageModule( stage=SingleSource) SourceBridgeModule
  45. 45. GearpumpMaterializer ActorRefSink 4. Local graph is passed to a LocalGraphMaterializer SinkBridgeModule GraphStageModule( stage=SingleSource) SourceBridgeModule LocalGraphMaterializer is a variant (subtype) of ActorMaterializer
  46. 46. GearpumpMaterializer 5. Converts the remote graph’s Modules into Tasks SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  47. 47. GearpumpMaterializer 6. Sends this Graph to the Gearpump master SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  48. 48. GearpumpMaterializer 7. Materialization is controlled at BridgeTasks SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  49. 49. Example 2 No local graph. More typical of distributed apps. implicit val materializer = GearpumpMaterializer() val sink = GearSink.to(new LoggerSink[String])) val sourceData = new CollectionDataSource( List("red hat", "yellow sweater", "blue jack", "red apple", "green plant", "blue sky")) val source = GearSource.from[String](sourceData) source.filter(_.startsWith("red")).map("I want to order item: " + _).runWith(sink)
  50. 50. Example 3 More complex Graph with loops implicit val materializer = GearpumpMaterializer() RunnableGraph.fromGraph(GraphDSL.create() { implicitbuilder => val A = builder.add(Source.single(0)).out val B = builder.add(Broadcast[Int](2)) val C = builder.add(Merge[Int](2)) val D = builder.add(Flow[Int].map(_ + 1)) val E = builder.add(Balance[Int](2)) val F = builder.add(Merge[Int](2)) val G = builder.add(Sink.foreach(println)).in C <~ F A ~> B ~> C ~> F B ~> D ~> E ~> F E ~> G ClosedShape }).run()
  51. 51. Summary ● Akka-streams provides a compelling programming model that enables declarative pipeline reuse and extensibility. ● Akka-streams allows different materializers to control and materialize different parts of the module tree. ● It’s possible to provide a seamless (or nearly seamless) conversion of akka-streams to run in a distributed setting by merely replacing ActorMaterializer with GearpumpMaterializer. ● Alternative distributed materializers can be implemented using a similar approach. ● Distributed akka-streams via Apache Gearpump will be available in the next release of Apache Gearpump (0.8.2) or will be made available within an akka specific repo.
  52. 52. Thank you twitter: @ApacheGearpump @kkasravi

×