@s_kontopoulos
Streaming Analytics: State of The Art
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare: stavroskontopoulos
stavroskontopoulos
@s_kontopoulos
Agenda
- Streaming Analytics - What & Why & How
- Streaming Platforms - Streaming Engines
- Code examples & Demo
3
@s_kontopoulos
Insights
Data Insight: A conclusion, a piece of information which can be used to make
actions and optimize a decision-making process.
Customer Insight: A non-obvious understanding about your customers, which if
acted upon, has the potential to change their behaviour for mutual benefit
Customer insight, Wikipedia
DAT
A
INFO INSIGHTS ACTIONS
4
@s_kontopoulos
The Gap
5
DATA INSIGHTS
@s_kontopoulos
Streaming Analytics - Bridging the Gap
Collect Analyze
Data Output Flow
(alarms, visualizations, ML scoring, etc)Data Input Flow
(sensors, mobile apps, etc)
Permanent Store
DATA FLOW
6
@s_kontopoulos
Streaming Analytics
“Streaming Analytics is the acquisition and analysis of the data at the moment it
streams into the system. It is a process done in a near real-time(NRT) fashion and
analysis results trigger specific actions for the system to execute.“
● No constraints or deadlines in the way they exist in RT systems
● Processing delay (end-to-end) varies and depends on the application ( < 1 ms
to minutes)
7
@s_kontopoulos
Big Data vs Fast Data
● Data in motion is the key characteristic.
● Fast Data is the new Big Data!
8
Two categories of systems: batch vs
streaming systems.
@s_kontopoulos
Common Use Cases
9
Image: Lightbend Inc.
@s_kontopoulos
Speed?
10
Image: Lightbend Inc.
@s_kontopoulos
Batch Data Pipeline
11
Analysis
New Data Batch View
Traditional MapReduce
paradigm
Image: Lightbend Inc.
@s_kontopoulos
Streaming Data Pipeline
In memory processing as data flows...
12
Analysis
New Data NR-Time View
Apache Flink
Akka Streams
Streaming Platform
Apache Kafka
Streams
@s_kontopoulos
Streaming Platforms
Its an ecosystem/environment that supports building and running streaming
applications. At its core it uses a streaming engine. Example of tools:
● A durable pub/sub component to fetch or store data
● A streaming engine
● A registry for storing data metadata like the data format etc.
13
@s_kontopoulos
Streaming Platforms - Some Examples
- Fast Data Platform (https://www.lightbend.com/products/fast-data-platform)
- Confluent Enterprise (https://www.confluent.io/product/confluent-platform)
- Da-Platform-2 (https://data-artisans.com/da-platform-2)
- Databricks Platform (https://databricks.com/product/unified-analytics-platform)
- IBM Streams
(https://www.ibm.com/analytics/us/en/technology/stream-computing/
- MapR Streams (https://mapr.com/products/mapr-streams/)
- Pravega (http://pravega.io)
...
14
@s_kontopoulos
Streaming Engine - the Core
A streaming engine provides the basic capabilities for developing and deploying
streaming applications.
Some systems like Kafka Streams or Akka Streams which
are just libraries don’t cover deployment effectively.
15
@s_kontopoulos
Streaming Engine - Key Features I
● Fault - Tolerance
● Processing Guarantees
● Checkpointing
● Streaming SQL
● Batch - Streaming API
● Language Integration (Python, Java, Scala, R)
● Stateful Management, User Session state
● Locality Awareness
● Backpressure
16
@s_kontopoulos
Streaming Engine - Key Features II
● Multi-Scheduler Support: Yarn, Mesos, Kubernetes
● Micro batching vs Data Flow
● ML, Graph, CEP
● Connectors (Sources, Sinks)
● Memory - Disk management (shuffling)
● Security (Kerberos etc)
17
@s_kontopoulos
DataFlow Execution Model
User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG. The data-sets are considered as immutable
distributed data. DAG is shipped to nodes where the data lie, computation is
executed and results are sent back to the user.
18
Spark Model
example
Flink model - FLIP 6
@s_kontopoulos
Streaming Engine - Which one to choose?
19
Some engines to consider...
Image: Lightbend Inc.
@s_kontopoulos
The Modern Enterprise Fast Data Architecture
20
Infrastructure (on premise, cloud)
Cluster scheduler
(Yarn, Standalone, Kubernetes, Mesos)
Fast Data
Apps
Micro
Services
ML
Operations
Monitoring
Security
Governance
Permanent
Storage
(HDFS, S3...)
Streaming Platform
(pub/sub, streaming engine, etc)
BI Data Lake
@s_kontopoulos
Example Fast Data Architecture for the Enterprise
21Image: Lightbend Inc.
@s_kontopoulos
Analyzing Data Streams
Processing infinite data streams imposes certain restrictions compared to batch
processing:
- We may need to trade-off accuracy with space and time costs eg use
approximate algorithms or sketches eg count-min for summarizing stream
data.
- Streaming jobs require to operate 24/7 and need to be able to adapt to code
changes, failures and load variance.
22
@s_kontopoulos
Analyzing Data Streams
● Data flows from one or more sources through the engine and is written to one
or more sinks.
● Two cases for processing:
○ Single event processing: event transformation, trigger an alarm on an error event
○ Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
23
@s_kontopoulos
Analyzing Data Streams
● Event aggregation introduces the concept of windowing wrt to the notion of
time selected:
○ Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
○ Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
○ System Arrival or Ingestion time (the time that events arrived at the streaming system).
● Ideally event time = processing time. Reality is: there is skew.
24
@s_kontopoulos
Analyzing Data Streams
● Windows come in different flavors:
○ Tumbling windows discretize a stream into non-overlapping windows.
○ Sliding Windows: slide over the stream of data.
25
Images:https://flink.apache.org/news/2015/12/0
4/Introducing-windows.html
@s_kontopoulos
Analyzing Data Streams
● Watermarks: indicates that no elements with a timestamp older or equal to
the watermark timestamp should arrive for the specific window of data. Marks
the progress of the event time.
● Triggers: decide when the window is evaluated or purged. Affect latency &
state kept.
● Late data: provide a threshold for how late data can be compared to current
watermark value.
26
@s_kontopoulos
Analyzing Data Streams
● Recent advances (like the concept of watermarks etc) in Streaming are a
result of the pioneer work:
○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB
2013.
○ The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
● The-world-beyond-batch-streaming-101 (Tyler Akidau)
● The-world-beyond-batch-streaming-102 (Tyler Akidau)
27
@s_kontopoulos
Analyzing Data Streams
● Apache Beam is the open source successor of Google’s
DataFlow.
● Provides the advanced semantics needed for the current
needs in streaming applications.
● Google DataFlow, Apache Flink, Apache Spark follow
that model.
(https://beam.apache.org/documentation/runners/capabili
ty-matrix)
28
@s_kontopoulos
Streams meet distributed log - I
Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated
with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of
Apache Bookeeper).
*Pravega is an open-source streaming storage system.
29
@s_kontopoulos
Streams meet distributed log - II
Distributed log possible use cases:
● Implement external services (micro-services)
● Implement internal operations (eg. kafka streams shuffling, fault-tolerance)
30
@s_kontopoulos
Processing Guarantees
Many things can go wrong…
● At-most once
● At-least once
● Exactly once
What are the boundaries?
Within the streaming engine?
How about end-to-end including sources and sinks?
How about side effects like calling an external service?
31
@s_kontopoulos
Table Stream Duality
Stream table : The aggregation of a stream of updates over time yields a
table.
Table stream: The observation of changes to a table over time yields a
stream.
Why is this useful?
32
@s_kontopoulos
Streaming SQL Queries
Semantics ? How we define a join on an unbounded stream? Table join?
There is a joint work from:
https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu
mJJ5f5WUzTiM/
33
Apache Flink
@s_kontopoulos
34
Streaming Applications - Spark Structured Streaming API
create spark session and read
from kafka topic
@s_kontopoulos
Streaming Applications - Spark Structured Streaming API
35
sensor metadata
emit complete output for every
window update based on
event-time to console. Setup a
trigger.
@s_kontopoulos
Streaming Applications - Flink Streaming API
36
custom source
initial sensor values
@s_kontopoulos
Streaming Applications - Flink Streaming API
37
watermark generation
create some random data
@s_kontopoulos
Streaming Applications - Flink Streaming API
38
create a windowed keyed
stream
apply a function per window
@s_kontopoulos
Kafka Streams vs Beam Model
- Trigger is more of an operational aspect compared to business parameters
like the window length. How often do I update my computation (affecting
latency and state size) is a non-functional requirement.
- A Table covers both the case of immutable data and the case of updatable
data.
39
@s_kontopoulos
Kafka Streams vs Beam Model
KTable<Windowed<String>, Long> aggregated = inputStream
.groupByKey()
.reduce((aggValue, newValue) -> aggValue + newValue,
TimeWindows.of(TimeUnit.MINUTES.toMillis(2))
.until(TimeUnit.DAYS.toMillis(1) /* keep for one day */),
"queryStoreName");
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */);
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model
40
@s_kontopoulos
41
Source code: http://bit.ly/2yhDCeN
@s_kontopoulos
Thank you! Questions?
42
https://github.com/skonto/talks/blob/master/big-data-italy-2017/streaming
-analytics/references.md

Streaming analytics state of the art

  • 1.
    @s_kontopoulos Streaming Analytics: Stateof The Art Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2.
    @s_kontopoulos Who am I? 2 skonto s_kontopoulos S.Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare: stavroskontopoulos stavroskontopoulos
  • 3.
    @s_kontopoulos Agenda - Streaming Analytics- What & Why & How - Streaming Platforms - Streaming Engines - Code examples & Demo 3
  • 4.
    @s_kontopoulos Insights Data Insight: Aconclusion, a piece of information which can be used to make actions and optimize a decision-making process. Customer Insight: A non-obvious understanding about your customers, which if acted upon, has the potential to change their behaviour for mutual benefit Customer insight, Wikipedia DAT A INFO INSIGHTS ACTIONS 4
  • 5.
  • 6.
    @s_kontopoulos Streaming Analytics -Bridging the Gap Collect Analyze Data Output Flow (alarms, visualizations, ML scoring, etc)Data Input Flow (sensors, mobile apps, etc) Permanent Store DATA FLOW 6
  • 7.
    @s_kontopoulos Streaming Analytics “Streaming Analyticsis the acquisition and analysis of the data at the moment it streams into the system. It is a process done in a near real-time(NRT) fashion and analysis results trigger specific actions for the system to execute.“ ● No constraints or deadlines in the way they exist in RT systems ● Processing delay (end-to-end) varies and depends on the application ( < 1 ms to minutes) 7
  • 8.
    @s_kontopoulos Big Data vsFast Data ● Data in motion is the key characteristic. ● Fast Data is the new Big Data! 8 Two categories of systems: batch vs streaming systems.
  • 9.
  • 10.
  • 11.
    @s_kontopoulos Batch Data Pipeline 11 Analysis NewData Batch View Traditional MapReduce paradigm Image: Lightbend Inc.
  • 12.
    @s_kontopoulos Streaming Data Pipeline Inmemory processing as data flows... 12 Analysis New Data NR-Time View Apache Flink Akka Streams Streaming Platform Apache Kafka Streams
  • 13.
    @s_kontopoulos Streaming Platforms Its anecosystem/environment that supports building and running streaming applications. At its core it uses a streaming engine. Example of tools: ● A durable pub/sub component to fetch or store data ● A streaming engine ● A registry for storing data metadata like the data format etc. 13
  • 14.
    @s_kontopoulos Streaming Platforms -Some Examples - Fast Data Platform (https://www.lightbend.com/products/fast-data-platform) - Confluent Enterprise (https://www.confluent.io/product/confluent-platform) - Da-Platform-2 (https://data-artisans.com/da-platform-2) - Databricks Platform (https://databricks.com/product/unified-analytics-platform) - IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/ - MapR Streams (https://mapr.com/products/mapr-streams/) - Pravega (http://pravega.io) ... 14
  • 15.
    @s_kontopoulos Streaming Engine -the Core A streaming engine provides the basic capabilities for developing and deploying streaming applications. Some systems like Kafka Streams or Akka Streams which are just libraries don’t cover deployment effectively. 15
  • 16.
    @s_kontopoulos Streaming Engine -Key Features I ● Fault - Tolerance ● Processing Guarantees ● Checkpointing ● Streaming SQL ● Batch - Streaming API ● Language Integration (Python, Java, Scala, R) ● Stateful Management, User Session state ● Locality Awareness ● Backpressure 16
  • 17.
    @s_kontopoulos Streaming Engine -Key Features II ● Multi-Scheduler Support: Yarn, Mesos, Kubernetes ● Micro batching vs Data Flow ● ML, Graph, CEP ● Connectors (Sources, Sinks) ● Memory - Disk management (shuffling) ● Security (Kerberos etc) 17
  • 18.
    @s_kontopoulos DataFlow Execution Model Userdefines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. The data-sets are considered as immutable distributed data. DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. 18 Spark Model example Flink model - FLIP 6
  • 19.
    @s_kontopoulos Streaming Engine -Which one to choose? 19 Some engines to consider... Image: Lightbend Inc.
  • 20.
    @s_kontopoulos The Modern EnterpriseFast Data Architecture 20 Infrastructure (on premise, cloud) Cluster scheduler (Yarn, Standalone, Kubernetes, Mesos) Fast Data Apps Micro Services ML Operations Monitoring Security Governance Permanent Storage (HDFS, S3...) Streaming Platform (pub/sub, streaming engine, etc) BI Data Lake
  • 21.
    @s_kontopoulos Example Fast DataArchitecture for the Enterprise 21Image: Lightbend Inc.
  • 22.
    @s_kontopoulos Analyzing Data Streams Processinginfinite data streams imposes certain restrictions compared to batch processing: - We may need to trade-off accuracy with space and time costs eg use approximate algorithms or sketches eg count-min for summarizing stream data. - Streaming jobs require to operate 24/7 and need to be able to adapt to code changes, failures and load variance. 22
  • 23.
    @s_kontopoulos Analyzing Data Streams ●Data flows from one or more sources through the engine and is written to one or more sinks. ● Two cases for processing: ○ Single event processing: event transformation, trigger an alarm on an error event ○ Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream. 23
  • 24.
    @s_kontopoulos Analyzing Data Streams ●Event aggregation introduces the concept of windowing wrt to the notion of time selected: ○ Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection. ○ Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second. ○ System Arrival or Ingestion time (the time that events arrived at the streaming system). ● Ideally event time = processing time. Reality is: there is skew. 24
  • 25.
    @s_kontopoulos Analyzing Data Streams ●Windows come in different flavors: ○ Tumbling windows discretize a stream into non-overlapping windows. ○ Sliding Windows: slide over the stream of data. 25 Images:https://flink.apache.org/news/2015/12/0 4/Introducing-windows.html
  • 26.
    @s_kontopoulos Analyzing Data Streams ●Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data. Marks the progress of the event time. ● Triggers: decide when the window is evaluated or purged. Affect latency & state kept. ● Late data: provide a threshold for how late data can be compared to current watermark value. 26
  • 27.
    @s_kontopoulos Analyzing Data Streams ●Recent advances (like the concept of watermarks etc) in Streaming are a result of the pioneer work: ○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013. ○ The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803 ● The-world-beyond-batch-streaming-101 (Tyler Akidau) ● The-world-beyond-batch-streaming-102 (Tyler Akidau) 27
  • 28.
    @s_kontopoulos Analyzing Data Streams ●Apache Beam is the open source successor of Google’s DataFlow. ● Provides the advanced semantics needed for the current needs in streaming applications. ● Google DataFlow, Apache Flink, Apache Spark follow that model. (https://beam.apache.org/documentation/runners/capabili ty-matrix) 28
  • 29.
    @s_kontopoulos Streams meet distributedlog - I Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of Apache Bookeeper). *Pravega is an open-source streaming storage system. 29
  • 30.
    @s_kontopoulos Streams meet distributedlog - II Distributed log possible use cases: ● Implement external services (micro-services) ● Implement internal operations (eg. kafka streams shuffling, fault-tolerance) 30
  • 31.
    @s_kontopoulos Processing Guarantees Many thingscan go wrong… ● At-most once ● At-least once ● Exactly once What are the boundaries? Within the streaming engine? How about end-to-end including sources and sinks? How about side effects like calling an external service? 31
  • 32.
    @s_kontopoulos Table Stream Duality Streamtable : The aggregation of a stream of updates over time yields a table. Table stream: The observation of changes to a table over time yields a stream. Why is this useful? 32
  • 33.
    @s_kontopoulos Streaming SQL Queries Semantics? How we define a join on an unbounded stream? Table join? There is a joint work from: https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu mJJ5f5WUzTiM/ 33 Apache Flink
  • 34.
    @s_kontopoulos 34 Streaming Applications -Spark Structured Streaming API create spark session and read from kafka topic
  • 35.
    @s_kontopoulos Streaming Applications -Spark Structured Streaming API 35 sensor metadata emit complete output for every window update based on event-time to console. Setup a trigger.
  • 36.
    @s_kontopoulos Streaming Applications -Flink Streaming API 36 custom source initial sensor values
  • 37.
    @s_kontopoulos Streaming Applications -Flink Streaming API 37 watermark generation create some random data
  • 38.
    @s_kontopoulos Streaming Applications -Flink Streaming API 38 create a windowed keyed stream apply a function per window
  • 39.
    @s_kontopoulos Kafka Streams vsBeam Model - Trigger is more of an operational aspect compared to business parameters like the window length. How often do I update my computation (affecting latency and state size) is a non-functional requirement. - A Table covers both the case of immutable data and the case of updatable data. 39
  • 40.
    @s_kontopoulos Kafka Streams vsBeam Model KTable<Windowed<String>, Long> aggregated = inputStream .groupByKey() .reduce((aggValue, newValue) -> aggValue + newValue, TimeWindows.of(TimeUnit.MINUTES.toMillis(2)) .until(TimeUnit.DAYS.toMillis(1) /* keep for one day */), "queryStoreName"); props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L); https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model 40
  • 41.
  • 42.