Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and data-driven applications


Published on

Witnessing the rise of stream processing from the driving seat, we see Apache Flink® and associated technologies used for a wide variety of business applications, from routing data through systems, serving as a backbone for real-time analytics on live data using SQL, detecting credit card fraud, to implementing complete end-to-end social networks. Such applications enable modern data-driven businesses where decisions and actions happen in real-time, and transform traditional businesses to become more data-driven. Observing the variety of these applications implemented using Flink, it becomes apparent that the traditional dividing line between analytics and operational applications is becoming more and more blurry. Historically, operational applications were built using transactional databases, and analytics were done offline. In contrast, Flink’s, state, checkpoints, and time management are the core building blocks for both operational applications with strong data consistency needs, and for real-time analytics with correctness guarantees. With these shared building blocks, developers start building what is arguably a new class of data-driven applications: applications that are operational in that they serve live systems and at the same time analytical in that they perform complex data analysis. Following application architectures like CQRS and using new features like Flink’s queryable state, streaming analytics and online applications move even closer to each other. In this talk, guided by real-world use cases, we present how the unique core concepts behind Flink simplify the development, deployment, and management of data-driven applications, and we conclude with a vision for the future for Flink and stream processing.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and data-driven applications

  1. 1. Big thanks to everyone!
  2. 2. The convergence of real-time analytics and event-driven applications @StephanEwen Flink Forward San Francisco April 11, 2017 2
  3. 3. 3 2016 was the year when streaming technologies became mainstream 2017 is the year to realize the full spectrum of streaming applications
  4. 4. Some large scale streaming applications 4
  5. 5. 5 Detecting fraud in real time As fraudsters get better, need to update models without downtime Live 24/7 service Credit card transactions Notifications and alerts Evolving fraud models built by data scientists @
  6. 6. 6 @  Athena X  SQL to define metrics  Thresholds and actions to trigger  Blends analytics and actions Streams from Hadoop, Kafka, etc SQL, thresholds, actions Analytics Alerts Derived streams
  7. 7. 7  Route events to Kafka, ES, Hive  Complex interaction sessions rules  Mix of stateless / small state / large state  Stream Processing as a Service • Launching, monitoring, scaling, updating • DSL to define jobs @
  8. 8. 8  Blink based on Flink  A core system in Alibaba Search • Machine learning, search, recommendations • A/B testing of search algorithms • Online feature updates to boost conversion rate  Alibaba is a major contributor to Flink  Contributing many changes back to open source @
  9. 9. 9 @ Complete social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation)
  10. 10. What can we learn from these? 10  All these applications run on Flink   Applications, not just analytics • Not just finding out what the data means but acting on that at the same time  Workloads going beyond the traditional Hadoop realm • Hadoop is possible deploy, source, and sink • Container engines and other storage systems increasingly popular with Flink
  11. 11. So, what is data streaming? 11  First wave for streaming was lambda architecture • Aid batch systems to be more real-time  Second wave was analytics (real time and lag-time) • Based on distributed collections, functions, and windows  The next wave is much broader: A new architecture for event-driven applications
  12. 12. Event–driven applications 12
  13. 13. Events, State, Time, and Snapshots 14 f(a,b) Event-driven function executed distributedly
  14. 14. Events, State, Time, and Snapshots 15 f(a,b) Maintain fault tolerant local state similar to any normal application
  15. 15. Events, State, Time, and Snapshots 16 f(a,b) wall clock event time clock Access and react to notions of time and progress, handle out-of-order events
  16. 16. Events, State, Time, and Snapshots 17 f(a,b) wall clock event time clock Snapshot point-in-time view for recovery, rollback, cloning, versioning, etc.
  17. 17. Event–driven applications 18 Event-driven Applications Stream Processing Batch Processing Stateful, event-driven, event-time-aware processing (event sourcing, CQRS, …) (streams, windows, …) (data sets)
  18. 18. The APIs 19 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL Stream- & Batch Processing Analytics Stateful Event-Driven Applications
  19. 19. Process Function 20 class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } }
  20. 20. Data Stream API 21 val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer09<>(…)) val events: DataStream[Event] = => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path))
  21. 21. Table API & Stream SQL 22
  22. 22. Streaming Architecture for Event-driven Applications 23
  23. 23. Compute, State, and Storage 24 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  24. 24. Performance 25 synchronous reads/writes across tier boundary asynchronous writes of large blobs all modifications are local Classic tiered architecture Streaming architecture
  25. 25. Consistency 26 distributed transactions at scale typically at-most / at-least once exactly once per state =1 =1snapshot consistency across states Classic tiered architecture Streaming architecture
  26. 26. Scaling a Service 27 separately provision additional database capacity provision compute and state together Classic tiered architecture Streaming architecture provision compute
  27. 27. Rolling out a new Service 28 provision a new database (or add capacity to an existing one) provision compute and state together simply occupies some additional backup space Classic tiered architecture Streaming architecture
  28. 28. Time, Completeness, Out-of-order 29 ? event time clocks define data completeness event time timers handle actions for out-of-order data Classic tiered architecture Streaming architecture
  29. 29. Repair External State 30 Streaming architecture streams (lets say Kafka etc) live application external state wrong results backed up data (HDFS, S3, etc.)
  30. 30. Repair External State 31 Streaming architecture live application external state overwrite with correct results streams (lets say Kafka etc) backed up data (HDFS, S3, etc.) application on backup input
  31. 31. Repair External State 32 Streaming architecture live application external state overwrite with correct results streams (lets say Kafka etc) backed up date (HDFS, S3, etc.) Each service doubles as a batch job! application on backup input
  32. 32. 33 Streaming has outgrown the Hadoop Stack Event-driven applications and realtime analytics converge with Apache Flink Event-driven applications become easier to manage, faster, and more powerful following a streaming architecture implemented with Flink