Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming analytics state of the art


Published on

Big Data & Data Science International Conference 20174-5 December, 2017Rome, Italy

Published in: Software
  • Be the first to comment

  • Be the first to like this

Streaming analytics state of the art

  1. 1. @s_kontopoulos Streaming Analytics: State of The Art Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  2. 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare: stavroskontopoulos stavroskontopoulos
  3. 3. @s_kontopoulos Agenda - Streaming Analytics - What & Why & How - Streaming Platforms - Streaming Engines - Code examples & Demo 3
  4. 4. @s_kontopoulos Insights Data Insight: A conclusion, a piece of information which can be used to make actions and optimize a decision-making process. Customer Insight: A non-obvious understanding about your customers, which if acted upon, has the potential to change their behaviour for mutual benefit Customer insight, Wikipedia DAT A INFO INSIGHTS ACTIONS 4
  5. 5. @s_kontopoulos The Gap 5 DATA INSIGHTS
  6. 6. @s_kontopoulos Streaming Analytics - Bridging the Gap Collect Analyze Data Output Flow (alarms, visualizations, ML scoring, etc)Data Input Flow (sensors, mobile apps, etc) Permanent Store DATA FLOW 6
  7. 7. @s_kontopoulos Streaming Analytics “Streaming Analytics is the acquisition and analysis of the data at the moment it streams into the system. It is a process done in a near real-time(NRT) fashion and analysis results trigger specific actions for the system to execute.“ ● No constraints or deadlines in the way they exist in RT systems ● Processing delay (end-to-end) varies and depends on the application ( < 1 ms to minutes) 7
  8. 8. @s_kontopoulos Big Data vs Fast Data ● Data in motion is the key characteristic. ● Fast Data is the new Big Data! 8 Two categories of systems: batch vs streaming systems.
  9. 9. @s_kontopoulos Common Use Cases 9 Image: Lightbend Inc.
  10. 10. @s_kontopoulos Speed? 10 Image: Lightbend Inc.
  11. 11. @s_kontopoulos Batch Data Pipeline 11 Analysis New Data Batch View Traditional MapReduce paradigm Image: Lightbend Inc.
  12. 12. @s_kontopoulos Streaming Data Pipeline In memory processing as data flows... 12 Analysis New Data NR-Time View Apache Flink Akka Streams Streaming Platform Apache Kafka Streams
  13. 13. @s_kontopoulos Streaming Platforms Its an ecosystem/environment that supports building and running streaming applications. At its core it uses a streaming engine. Example of tools: ● A durable pub/sub component to fetch or store data ● A streaming engine ● A registry for storing data metadata like the data format etc. 13
  14. 14. @s_kontopoulos Streaming Platforms - Some Examples - Fast Data Platform ( - Confluent Enterprise ( - Da-Platform-2 ( - Databricks Platform ( - IBM Streams ( - MapR Streams ( - Pravega ( ... 14
  15. 15. @s_kontopoulos Streaming Engine - the Core A streaming engine provides the basic capabilities for developing and deploying streaming applications. Some systems like Kafka Streams or Akka Streams which are just libraries don’t cover deployment effectively. 15
  16. 16. @s_kontopoulos Streaming Engine - Key Features I ● Fault - Tolerance ● Processing Guarantees ● Checkpointing ● Streaming SQL ● Batch - Streaming API ● Language Integration (Python, Java, Scala, R) ● Stateful Management, User Session state ● Locality Awareness ● Backpressure 16
  17. 17. @s_kontopoulos Streaming Engine - Key Features II ● Multi-Scheduler Support: Yarn, Mesos, Kubernetes ● Micro batching vs Data Flow ● ML, Graph, CEP ● Connectors (Sources, Sinks) ● Memory - Disk management (shuffling) ● Security (Kerberos etc) 17
  18. 18. @s_kontopoulos DataFlow Execution Model User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. The data-sets are considered as immutable distributed data. DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. 18 Spark Model example Flink model - FLIP 6
  19. 19. @s_kontopoulos Streaming Engine - Which one to choose? 19 Some engines to consider... Image: Lightbend Inc.
  20. 20. @s_kontopoulos The Modern Enterprise Fast Data Architecture 20 Infrastructure (on premise, cloud) Cluster scheduler (Yarn, Standalone, Kubernetes, Mesos) Fast Data Apps Micro Services ML Operations Monitoring Security Governance Permanent Storage (HDFS, S3...) Streaming Platform (pub/sub, streaming engine, etc) BI Data Lake
  21. 21. @s_kontopoulos Example Fast Data Architecture for the Enterprise 21Image: Lightbend Inc.
  22. 22. @s_kontopoulos Analyzing Data Streams Processing infinite data streams imposes certain restrictions compared to batch processing: - We may need to trade-off accuracy with space and time costs eg use approximate algorithms or sketches eg count-min for summarizing stream data. - Streaming jobs require to operate 24/7 and need to be able to adapt to code changes, failures and load variance. 22
  23. 23. @s_kontopoulos Analyzing Data Streams ● Data flows from one or more sources through the engine and is written to one or more sinks. ● Two cases for processing: ○ Single event processing: event transformation, trigger an alarm on an error event ○ Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream. 23
  24. 24. @s_kontopoulos Analyzing Data Streams ● Event aggregation introduces the concept of windowing wrt to the notion of time selected: ○ Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection. ○ Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second. ○ System Arrival or Ingestion time (the time that events arrived at the streaming system). ● Ideally event time = processing time. Reality is: there is skew. 24
  25. 25. @s_kontopoulos Analyzing Data Streams ● Windows come in different flavors: ○ Tumbling windows discretize a stream into non-overlapping windows. ○ Sliding Windows: slide over the stream of data. 25 Images: 4/Introducing-windows.html
  26. 26. @s_kontopoulos Analyzing Data Streams ● Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data. Marks the progress of the event time. ● Triggers: decide when the window is evaluated or purged. Affect latency & state kept. ● Late data: provide a threshold for how late data can be compared to current watermark value. 26
  27. 27. @s_kontopoulos Analyzing Data Streams ● Recent advances (like the concept of watermarks etc) in Streaming are a result of the pioneer work: ○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013. ○ The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803 ● The-world-beyond-batch-streaming-101 (Tyler Akidau) ● The-world-beyond-batch-streaming-102 (Tyler Akidau) 27
  28. 28. @s_kontopoulos Analyzing Data Streams ● Apache Beam is the open source successor of Google’s DataFlow. ● Provides the advanced semantics needed for the current needs in streaming applications. ● Google DataFlow, Apache Flink, Apache Spark follow that model. ( ty-matrix) 28
  29. 29. @s_kontopoulos Streams meet distributed log - I Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of Apache Bookeeper). *Pravega is an open-source streaming storage system. 29
  30. 30. @s_kontopoulos Streams meet distributed log - II Distributed log possible use cases: ● Implement external services (micro-services) ● Implement internal operations (eg. kafka streams shuffling, fault-tolerance) 30
  31. 31. @s_kontopoulos Processing Guarantees Many things can go wrong… ● At-most once ● At-least once ● Exactly once What are the boundaries? Within the streaming engine? How about end-to-end including sources and sinks? How about side effects like calling an external service? 31
  32. 32. @s_kontopoulos Table Stream Duality Stream table : The aggregation of a stream of updates over time yields a table. Table stream: The observation of changes to a table over time yields a stream. Why is this useful? 32
  33. 33. @s_kontopoulos Streaming SQL Queries Semantics ? How we define a join on an unbounded stream? Table join? There is a joint work from: mJJ5f5WUzTiM/ 33 Apache Flink
  34. 34. @s_kontopoulos 34 Streaming Applications - Spark Structured Streaming API create spark session and read from kafka topic
  35. 35. @s_kontopoulos Streaming Applications - Spark Structured Streaming API 35 sensor metadata emit complete output for every window update based on event-time to console. Setup a trigger.
  36. 36. @s_kontopoulos Streaming Applications - Flink Streaming API 36 custom source initial sensor values
  37. 37. @s_kontopoulos Streaming Applications - Flink Streaming API 37 watermark generation create some random data
  38. 38. @s_kontopoulos Streaming Applications - Flink Streaming API 38 create a windowed keyed stream apply a function per window
  39. 39. @s_kontopoulos Kafka Streams vs Beam Model - Trigger is more of an operational aspect compared to business parameters like the window length. How often do I update my computation (affecting latency and state size) is a non-functional requirement. - A Table covers both the case of immutable data and the case of updatable data. 39
  40. 40. @s_kontopoulos Kafka Streams vs Beam Model KTable<Windowed<String>, Long> aggregated = inputStream .groupByKey() .reduce((aggValue, newValue) -> aggValue + newValue, TimeWindows.of(TimeUnit.MINUTES.toMillis(2)) .until(TimeUnit.DAYS.toMillis(1) /* keep for one day */), "queryStoreName"); props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L); 40
  41. 41. @s_kontopoulos 41 Source code:
  42. 42. @s_kontopoulos Thank you! Questions? 42 -analytics/