Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming analytics better than batch when and why - (Big Data Tech 2017)

1,526 views

Published on

While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility…

to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming. In this talk we’ll use a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Flink help you solve, and benefits you can get by using them today for processing your data as a stream.

Published in: Technology

Streaming analytics better than batch when and why - (Big Data Tech 2017)

  1. 1. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Streaming analytics better than batch - when and why ? _A. Kawa - D. Wysakowicz - K. Zarzycki_
  2. 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Have you ever built cool Big Data pipelines?
  3. 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  4. 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Use-Case ■ Can be done in batch and real-time ■ User session analytics at Spotify ● Simple stats ■ Duration, number of songs, skips, searches etc. ● Advanced analytics ■ Mood, physical activity, real-time content, ads
  5. 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? _1. Dashboards_
  6. 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! _1. Dashboards_ _2. Alerts_
  7. 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! Recommend songs and ads based on current activity. _1. Dashboards_ _2. Alerts_ _3. Content_
  8. 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1h - 1d 1h User Events User Sessions
  9. 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1d 1h User Events User Sessions
  10. 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The More Moving Parts … ⬇ The higher learning curve ⬇ The more gluing code ⬇ The larger administrative effort ⬇ The more error-prone solution
  11. 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Long Waiting Time Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
  12. 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 2nd - Micro-Batch Architecture 1m - 1h
  13. 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  14. 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  15. 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Late Data … ♪ ♪ ♪ ♪ ♪ ♪ Event Time 14:55 - 16:35 Processing Time
  16. 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Included in Current Batch ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ 14:55 - 16:35 16:50 - … Event Time Processing Time
  17. 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ Event Time Processing Time
  18. 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  19. 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  20. 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Breaks Correctness ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ Event Time Processing Time
  21. 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Problems FILES, BATCHES, DATA LAKES
  22. 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Solving Streaming Problem With Batch?
  23. 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 3rd - Streaming-First Architecture
  24. 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 3 ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5
  25. 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 3 ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5 [3,2]
  26. 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reading From Kafka val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
  27. 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 1 User 2
  28. 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) User 1 ♪ ♪ ♪ ♪ ♪ ♪ Session gap - 15 minutes ♪♪
  29. 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Analyzing User Session val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  30. 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Handling Late Events val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
  31. 31. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Triggering Early Results val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  32. 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Sessionization Example val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) Working example: https://github.com/getindata/flink-use-case
  33. 33. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing
  34. 34. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing User survey: http://data-artisans.com/flink-user-survey-2016-part-1 http://data-artisans.com/flink-user-survey-2016-part-2
  35. 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  36. 36. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How can I reprocess data?
  37. 37. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reprocessing Events In Flink 1. Take periodic snapshots of a job ● It stores Kafka offsets, on-flight sessions, application state 2. Restart a job from a savepoint rather than from a beginning
  38. 38. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What if data is no longer in Kafka?
  39. 39. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Consuming Data From HDFS ■ Run your streaming code on HDFS (bounded data) ● You need to read data in event-time based order
  40. 40. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How to join with other data sets/streams?
  41. 41. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join With Other Datasets / Streams ■ Flink can join windowed streams easily ■ Join of data stream with data set is WIP ● Even with slowly changing data set! ● Even keyed data Stream 2 Stream 1 Joined Stream Input Stream Joined Stream + Id Name 1 John Doe 2 Jane Doe Dataset +
  42. 42. © Copyright. All rights reserved. Not to be reproduced without prior written consent. When is batch processing good?
  43. 43. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Technical advantages ● A large swaths of historical data in HDFS ● High-level libraries in mature batch technologies
  44. 44. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Implementation advantages ● Offline experiments over large historical data ■ Historical events are usually stored in HDFS, not Kafka ● High-level libraries in batch processing technologies ■ Spark MLlib, H2O (when data arrives continuously) don’t solve streaming problem with batch jobs
  45. 45. © Copyright. All rights reserved. Not to be reproduced without prior written consent. I like this streaming API. Can I use it for batch?
  46. 46. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Unified batch and streaming API ■ Not with raw Flink API ■ But with Flink Table API ■ Apache Beam
  47. 47. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Who Are You, actually? ■ At GetInData, we build custom Big Data solutions ● Hadoop, Flink, Spark, Kafka and more ■ Our team is today represented by Krzysztof Zarzycki Dawid Wysakowicz Adam Kawa
  48. 48. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ■ Stream often the natural representation of your data ■ Stream processing is not only about low latency Summary
  49. 49. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  50. 50. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Thanks ! Big Data Tech Warsaw !
  51. 51. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  52. 52. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Log Abstraction 11:00 - 12:00 12:00 - 13:00 … … 10:00 - … 10:00 - … 10:00 - 11:00
  53. 53. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Spark Structured Streaming ⬇ It’s still ALPHA and the APIs are still experimental ⬇ Operates on top of micro-batches (Spark SQL engine) ⬆ Easy-to-learn API (Dataset/DataFrame) ⬆ Rich ecosystem of tools and libraries e.g. MLlib ⬆ Supports event-time ⬇ Sessionization not yet supported - SPARK-10816 ⬇ Queryable state not yet supported - SPARK-16738
  54. 54. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kafka Streams ⬇ No exactly-once (just at-least-once) ⬇ Kafka as the only data source ⬇ No bounded streams (batch) optimizations ⬆ Simplicity ⬆ Embedded into application ⬆ Supports event-time ⬇ Lack of session windows
  55. 55. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Beam ⬆ Unified API for batch and streaming ⬆ Rich streaming processing semantics ⬆ Complex TriggerDSL ⬆ Multiple runtime environments ⬆ Spark, Flink, Apex, Dataflow ⬆ Side inputs and outputs ⬇ Verbose Java API ⬇ New project - Top level since 01/2017
  56. 56. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Google Dataflow ■ Runtime environment for Apache Beam in Google Cloud ⬇ No support for Iterative Computations ⬆ Supports Side Outputs ⬆ Works with every Google Cloud Service (Pub/Sub, BigTable etc.)

×