Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 2017

1,079 views

Published on

While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming.

https://www.bigdataspain.org/2017/talk/streaming-analytics-better-than-batch-when-and-why

Big Data Spain 2017
16th - 17th November Kinépolis Madrid

Published in: Technology
  • Be the first to comment

Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 2017

  1. 1. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Streaming analytics better than batch - when and why ? _Adam Kawa - Dawid Wysakowicz -_ Krzysztof Zarzycki_
  2. 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Have you ever built cool Big Data pipelines?
  3. 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  4. 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Use-Case ■ Can be done in batch and real-time ■ User session analytics at Spotify ● Simple stats ■ Duration, number of songs, skips, searches etc. ● Advanced analytics ■ Mood, physical activity, real-time content, ads
  5. 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? _1. Dashboards_
  6. 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! _1. Dashboards_ _2. Alerts_
  7. 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! Recommend songs and ads based on current activity. _1. Dashboards_ _2. Alerts_ _3. Content_
  8. 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1h - 1d 1h User Events User Sessions
  9. 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1d 1h User Events User Sessions
  10. 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The More Moving Parts … ⬇ The higher learning curve ⬇ The more gluing code ⬇ The larger administrative effort ⬇ The more error-prone solution
  11. 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Long Waiting Time Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
  12. 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 2nd - Micro-Batch Architecture 1m - 1h
  13. 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  14. 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  15. 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Late Data … ♪ ♪ ♪ ♪ ♪ ♪ Event Time 14:55 - 16:35 Processing Time
  16. 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Included in Current Batch ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ 14:55 - 16:35 16:50 - … Event Time Processing Time
  17. 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ Event Time Processing Time
  18. 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  19. 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  20. 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Breaks Correctness ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ Event Time Processing Time
  21. 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Problems FILES, BATCHES, DATA LAKES
  22. 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Solving Streaming Problem With Batch?
  23. 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 3rd - Streaming-First Architecture
  24. 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ Case B ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5
  25. 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ Case B ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5 [3,2]
  26. 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reading From Kafka val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
  27. 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 1 User 2
  28. 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) User 1 ♪ ♪ ♪ ♪ ♪ ♪ Session gap - 15 minutes ♪♪
  29. 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Analyzing User Session val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  30. 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Handling Late Events val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
  31. 31. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Triggering Early Results val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  32. 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Sessionization Example val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) Working example: https://github.com/getindata/flink-use-case
  33. 33. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing
  34. 34. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing User survey: http://data-artisans.com/flink-user-survey-2016-part-1 http://data-artisans.com/flink-user-survey-2016-part-2
  35. 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  36. 36. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How can I reprocess data?
  37. 37. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reprocessing Events In Flink 1. Take periodic snapshots of a job ● It stores Kafka offsets, on-flight sessions, application state 2. Restart a job from a savepoint rather than from a beginning
  38. 38. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What if data is no longer in Kafka?
  39. 39. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Consuming Data From HDFS ■ Run your streaming code on HDFS (bounded data) ● You need to read data in event-time based order ● Implement mechanism of proper watermark generation
  40. 40. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What are usual stream processing applications?
  41. 41. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Stream Analytics Image source: https://www.slideshare.net/sinisalyh/storm-at-spotify
  42. 42. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Stream 24/7 Applications
  43. 43. © Copyright. All rights reserved. Not to be reproduced without prior written consent. When is batch processing good?
  44. 44. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Technical advantages ● A large swaths of historical data in HDFS ● High-level libraries in mature batch technologies
  45. 45. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Implementation advantages ● Offline experiments over large historical data ■ Historical events are usually stored in HDFS, not Kafka ● High-level libraries in batch processing technologies ■ Spark MLlib, H2O (when data arrives continuously) don’t solve streaming problem with batch jobs
  46. 46. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Who Are You, actually? ■ At GetInData, we build custom Big Data solutions ● Hadoop, Flink, Spark, Kafka and more ■ Our team is today represented by Krzysztof Zarzycki Dawid Wysakowicz Adam Kawa
  47. 47. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ■ Stream often the natural representation of your data ■ Stream processing is not only about low latency Summary
  48. 48. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  49. 49. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Thanks !
  50. 50. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  51. 51. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Log Abstraction 11:00 - 12:00 12:00 - 13:00 … … 10:00 - … 10:00 - … 10:00 - 11:00
  52. 52. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Spark Structured Streaming ⬇ Operates on top of micro-batches (Spark SQL engine) ■ The ALPHA version and the experimental API until July 11, 2017 ⬆ Easy-to-learn API (Dataset/DataFrame) ⬆ Rich ecosystem of tools and libraries e.g. MLlib ⬆ Supports event-time ⬇ Sessionization not yet supported - SPARK-10816 ⬇ Queryable state not yet supported - SPARK-16738
  53. 53. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kafka Streams ⬇ No exactly-once (just at-least-once) ⬇ Kafka as the only data source ⬇ No bounded streams (batch) optimizations ⬆ Simplicity ⬆ Embedded into application ⬆ Supports event-time ⬇ Lack of session windows
  54. 54. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Beam ⬆ Unified API for batch and streaming ⬆ Rich streaming processing semantics ⬆ Complex TriggerDSL ⬆ Multiple runtime environments ⬆ Spark, Flink, Apex, Dataflow ⬆ Side inputs and outputs ⬇ Verbose Java API ⬇ New project - Top level since 01/2017
  55. 55. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Google Dataflow ■ Runtime environment for Apache Beam in Google Cloud ⬇ No support for Iterative Computations ⬆ Supports Side Outputs ⬆ Works with every Google Cloud Service (Pub/Sub, BigTable etc.)
  56. 56. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How to join with other data sets/streams?
  57. 57. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join With Other Datasets / Streams ■ Flink can join windowed streams easily ■ Join of data stream with data set is WIP ● Even with slowly changing data set! ● Even keyed data Stream 2 Stream 1 Joined Stream Input Stream Joined Stream + Id Name 1 John Doe 2 Jane Doe Dataset +
  58. 58. © Copyright. All rights reserved. Not to be reproduced without prior written consent. I like this streaming API. Can I use it for batch?
  59. 59. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Unified batch and streaming API ■ Not with raw Flink API ■ But with Flink Table API ■ Apache Beam
  60. 60. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Is Flink production ready?
  61. 61. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Powered By Flink

×