Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data reactive streams and OSGi - M Rulli

93 views

Published on

OSGi Community Event 2017 Presentation by Matteo Rulli [FlairBit]

One of the basic requirement to enable big-data analytics is a rational and effective approach to data ingestion. In long running projects the need arises to evolve the domain model and this potentially affects data quality. As a consequence, the concept of versioning is crucial to keep data centric systems consistent: the importance of service dynamicity and good modularity support in a sound data ingestion workflow implementation cannot be easily overestimated.

This talk demonstrates how to combine OSGi declarative services and OSGi robust versioning support to enable complex data ingestion use cases such as serialization upcasting, domain and data models segregation and events versioning. Both Akka and Cassandra are offered as OSGi services to materialize big-data processing workflows with no pain.

Published in: Technology
  • Be the first to comment

Big data reactive streams and OSGi - M Rulli

  1. 1. BIG-DATA REACTIVE STREAMS AND OSGI FLAIRBIT MATTEO.RULLI@FLAIRBIT.IO 24/10/2017 Matteo Rulli 1
  2. 2. ARCHITECTURE
  3. 3. 2
  4. 4. PAIN POINTS Ensure data quality Guarantee fast data Detach the data model from the domain model Seamless and elastic scaling Geolocalization Data privacy Authorization 3
  5. 5. EVENTS ARE IMMUTABLE ... except for: No longer relevant fields Misnamed keys Missing information Accidental privacy / security leaks Requirement changes 4
  6. 6. THE SOURCE OF TRUTH We use Apache Cassandra as our unique source of truth. -Apache Cassandra Homepage The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. 5
  7. 7. DON'T PUT ALL YOUR EGGS IN ONE BUCKET Spread data evenly around the cluster Keep maximum partition size between 10MB and 100MB Minimize the number of partitions read Speaking about timeseries: Partition by device-id, measure-id and time buckets Dynamic partitioning depending on data rates Queries where clauses limited to these dimensions 6
  8. 8. HOW DO I QUERY THIS THING? Typical problem: get the last known status for all measures pertaining to a device? Different measures from the same device can be in different t- buckets: 7
  9. 9. STREAMS TO THE RESCUE
  10. 10. 8 . 1
  11. 11. Define the source: TimeBucket bucketing = TimeBucket.monthly() .startFrom(Instant.EPOCH).endAt(Instant.now()) .withOrdering(BucketOrdering.DESC); Source<Integer, NotUsed> source = Source.fromIterator(() -> bucketing); 8 . 2
  12. 12. Define the queries: List<Flow<Integer, Event, NotUsed>> flows = new ArrayList<Flow<Integer,Event,NotUsed>>(); for (final String capabilityId : capabilities) { Flow<Integer, Event, NotUsed> flow = Flow.of(Integer.class).map(bucket -> { BoundStatement bound = // ... bind the statement ... return Optional.ofNullable( entityMapper.map(session.execute(bound)).one()); }).filter(Optional::isPresent).take(1); flows.add(flow); } 8 . 3
  13. 13. Define the sink: Sink<Event, CompletionStage<List<Event>>> sink = Sink.seq(); 8 . 4
  14. 14. Wire the graph: RunnableGraph<CompletionStage<List<Event>>> graph = RunnableGraph.fromGraph( GraphDSL.create( sink, (builder, out) -> { UniformFanOutShape<Integer, Integer> bcast = builder.add(Broadcast.create(fanOutSize)); UniformFanInShape<Event, Event> merge = builder.add(Merge.create(fanOutSize)); Outlet<Integer> source = builder.add(in).out(); Iterator<Flow<Integer, Event, NotUsed>> fi = flows.iterator(); // initial wiring with the first query flow builder.from(source) .viaFanOut(bcast) .via(builder.add(fi.next())) .viaFanIn(merge) .to(out); // add remaining query flows while(fi.hasNext()) builder.from(bcast).via(builder.add(fi.next())).toFanIn(merge); return ClosedShape.getInstance(); })); 8 . 5
  15. 15. Get the result: List<Event> status = result.run(materializer).toCompletableFuture().get(); 8 . 6
  16. 16. STREAMS EXPRESSIVITY Reactive streams typically offer a rich set of built-in processing stages that can help processing data in a very concise way: fold/unfold zip/unzip combine map/concat/merge take(while)/drop(while)/head/last foreach/reduce/collect grouped sliding 9
  17. 17. DOES THIS STREAM THING WORK ON OSGI? import akka.actor.ActorSystem; import akka.osgi.ActorSystemActivator; import akka.stream.ActorMaterializer; public class TimeseriesActorSystem extends ActorSystemActivator { private ServiceRegistration<ActorMaterializer> registeredMaterializer; @Override public void configure(BundleContext context, ActorSystem system) { registerService(context, system); ActorMaterializer materializer = ActorMaterializer.create(system); // Register materializer for timeseries actor system Hashtable<String, String> props = new Hashtable<>(); props.put("name", "timeseries-materializer"); props.put("objectClass", ActorMaterializer.class.getSimpleName()); registeredMaterializer = context.registerService(ActorMaterializer.class, materializer, pro } @Override public String getActorSystemName(BundleContext context) { return "timeseries-actorsys"; } @Override public void stop(BundleContext context) { registeredMaterializer.unregister(); super.stop(context); } 10 . 1
  18. 18. And now you can use it where you need it: @Reference(cardinality=ReferenceCardinality.MANDATORY, target="(name=timeseries-materializer)") public void setMaterializer(ActorMaterializer materializer) { this.materializer = materializer; } 10 . 2
  19. 19. IS THIS REACTIVE? Good news, no more OutOfMemoryErrors: -The Akka Streams docs The hard problem of propagating and reacting to back-pressure has been incorporated in the design of Akka Streams already. 11 . 1
  20. 20. What is back pressure, by the way? -The Akka docs A means of flow-control, a way for consumers of data to notify a producer about their current availability, effectively slowing down the upstream producer to match their consumption speeds. In the context of Akka Streams back- pressure is always understood as non-blocking and asynchronous. 11 . 2
  21. 21. And this is incorporated in streams implementation: -The Akka Streams docs When we talk about asynchronous, non-blocking backpressure we mean that the processing stages available in Akka Streams will not use blocking calls but asynchronous message passing to exchange messages between each other, and they will use asynchronous means to slow down a fast producer, without blocking its thread. 11 . 3
  22. 22. EVENTS UPCASTING
  23. 23. 12 . 1
  24. 24. EVENTS UPCASTING WITH ADAPTERS ON OSGI Several (versions of) event adapters can be registered to take care of event upcasting when you need to adapt older events to newer model Adding a new adapter is a just a matter of installing a new bundle Adapters version can be matched with event versions to adapt timeseries records on the fly 12 . 2
  25. 25. CONCLUSIONS Choose your scalable and high performance source of truth Make it queryable in an elegant and concise way with akka streams or rx-java powerful expressivness Be reactive and leverage backpressure Use the source of truth to create your data marts and materialized views OSGi can help ensuring dynamicicy and integrated sem- versioning for event upcasting 13
  26. 26. THANKS 14

×