More and more data sources today provide a constant data stream, from Internet of Things devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store them in a data sink of choice, such as a data lake implemented with HDFS/object store, or in a database such as a NoSQL or even an RDBMS, if the volume of events is not too high. Storing a high-volume event stream is feasible and not such a challenge anymore. But doing it adds to the end-to-end latency and it’s a matter of minutes or hours until you can present some results of your analytics. If you need to react fast, you simply can't afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.
2. gschmutz
Agenda
Stream Processing – Concepts and Frameworks
1. Motivation for Stream Processing?
2. Capabilities for Stream Processing
3. Implementing Stream Processing Solutions
4. Summary
3. gschmutz
Guido Schmutz
Stream Processing – Concepts and Frameworks
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant, Trainer, Software Architect for Java, AWS, Azure,
Oracle Cloud, SOA and Big Data / Fast Data
Platform Architect & Head of Trivadis Architecture Board
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
155th edition
5. gschmutz
Challenges of Traditional BI Infrastructures
Enterprise Data
Warehouse
ETL / Stored
Procedures
Segmentation & Churn
Analysis
BI Tools
Marketing Offers
Bulk Source
DB
Extract
File
DB
6. gschmutz
Challenges of Traditional BI Infrastructures
Enterprise Data
Warehouse
ETL / Stored
Procedures
Segmentation & Churn
Analysis
BI Tools
Marketing Offers
Bulk Source
DB
Extract
File
DB
Limited
Processing
Power
Storage
Scaling
very
expensive
Based on
sample /
limited data
Loss in
Fidelity
Schema-
on-write
7. gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity
Stream Processing – Concepts and Frameworks
8. gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Stream Processing – Concepts and Frameworks
Event Stream
9. gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Stream Processing – Concepts and Frameworks
Event
Hub
Event
Hub
Event
Hub
Telemetry
10. gschmutz
"Data at Rest" vs. "Data in Motion"
Stream Processing – Concepts and Frameworks
Data at Rest Data in Motion
Store
(Re)Act
Visualize/
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
vs.
Visualize
11. gschmutz
Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
Stream Processing – Concepts and Frameworks
BI Tools
Enterprise Data
Warehouse
Event
Hub
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Telemetry
12. gschmutz
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
Stream Processing – Concepts and Frameworks
BI Tools
Enterprise Data
Warehouse
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Telemetry
13. gschmutz
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
Stream Processing – Concepts and Frameworks
BI Tools
Enterprise Data
Warehouse
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
File Import / SQL Import
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Telemetry
14. gschmutz
New systems participate in event-oriented fashion
Stream Processing – Concepts and Frameworks
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Event
Stream
Event
Stream
Telemetry
15. gschmutz
Edge computing allows processing close to data sources
Stream Processing – Concepts and Frameworks
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
File Import / SQL Import
Change DataCapture D
ata
Flow
Event
Hub
Data Flow
Event
Stream
Event
Stream
Event Stream
Telemetry
Rules
Event Hub
Storage
16. gschmutz
Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
Stream Processing – Concepts and Frameworks
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
17. gschmutz
Two Types of Stream Processing
(by Gartner)
Stream Processing – Concepts and Frameworks
Stream Data Integration
• focuses on the ingestion and processing
of data sources
• real-time extract-transform-load (ETL) and
data integration use cases
• filter and enrich the data
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information
• events may signify threats or
opportunities that require a response from
business
Gartner: Market Guide for Event Stream Processing, Nick Heudecker, W. Roy Schulte
19. gschmutz
Capabilities: Stream Data Integration vs. Stream Analytics
Stream Processing – Concepts and Frameworks
Stream Data Integration Stream Analytics
Support for Various Data Sources yes -
Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial
Execution Mode: Native Streaming yes yes
Execution Mode: Non-Native Streaming - Micro-Batching yes partial
Delivery Guarantees yes yes
API : GUI-Based API / Declarative API / Programmatic yes yes
API: Streaming SQL - yes
Event Time vs. Ingestion / Processing Time - yes
Windowing - yes
Stream-to-Static Joins (Lookup/Enrichment) partial yes
Stream-to-Stream Joins - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Event Pattern Detection - Yes
20. gschmutz
Integrating Data Sources
Stream Processing – Concepts and Frameworks
Sensor Stream
SQL Polling
Change Data Capture
(CDC)
File Polling
File Stream (File Tailing)
File Stream (Appender)
21. gschmutz
Streaming ETL
Stream Processing – Concepts and Frameworks
• Streaming Extract – Transform – Load
• Flow-based ”programming”
• High-Throughput, straight-through
data flows
• Visual coding with flow editor
• Stream Data Integration but not
Stream Analytics
22. gschmutz
Event-at-a-time vs. Micro Batch
Stream Processing – Concepts and Frameworks
Event-at-a-time Processing
• Events processed as they
arrive
• low(est)-latency
• fault tolerance expensive
Micro-Batch Processing
• Splits incoming stream in
small batches
• Fault tolerance easier to
achieve
• Higher latency
23. gschmutz
Delivery Guarantees
Stream Processing – Concepts and Frameworks
At most once (fire-and-forget)
§ message is sent, but the sender doesn’t care if it’s received or lost.
At least once
§ Retransmission of messages can cause messages to be sent one or
more times
Exactly once
§ ensures that a message is received once and only once (never lost
and never repeated)
[ 0 | 1 ]
[ 1+ ]
[ 1 ]
24. gschmutz
API
Stream Processing – Concepts and Frameworks
GUI-based / Drag-and-Drop
• A graphical way of designing a
pipeline
• Often web-based
Declarative
• An streaming engine configured
declaratively
• JSON, YML
"config": {
"connector.class": "..MqttSourceConnector",
"tasks.max": "1",
"mqtt.server.uri": "tcp://mosquitto-1:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
...
25. gschmutz
Programmatic
• Low-level (class) or high-level fluent
API
• Higher order function as operators
(filter, mapWithState …)
API (II)
Stream Processing – Concepts and Frameworks
Streaming SQL
• use stream in FROM clause
• Extensions support windowing, pattern
matching, spatial, ….
val filteredDf = truckPosDf.
where("eventType !='Normal'")
SELECT *
FROM truck_position_s
WHERE eventType != 'Normal'
26. gschmutz
Event Time vs. Ingestion / Processing Time
Stream Processing – Concepts and Frameworks
Event time
• time at which events actually occurred
Ingestion time / Processing Time
• time at which events are ingested into /
processed by the system
Not all use cases care about event times, but lot’s do!
27. gschmutz
Windowing
Stream Processing – Concepts and Frameworks
Computations over events done using windows of data
not feasible to keep entire stream of data in memory
window represents a certain amount of data to perform computations on
Time
Stream of Data Window of Data
28. gschmutz
Sliding / Hopping Window
eviction & trigger based on
window length and sliding interval
length
Fixed / Tumbling Window
eviction based on window being full
and trigger based on either count of
items or time
Session Window
sequences of temporarily related
events terminated by gap of
inactivity > than some timeout
Windowing
Stream Processing – Concepts and Frameworks
Time TimeTime
29. gschmutz
Joining – Stream-to-Static
Stream Processing – Concepts and Frameworks
Challenges of joining streams
• Data streams need to be aligned
because of their different
timestamps
• joins must be limited; otherwise
they will never end
• join needs to produce results
continuously
• there is no end to the data
Stream-to-Static (Table) Join
Stream-to-
Static Join
Time
30. gschmutz
Joining –Stream-to-Stream
Stream Processing – Concepts and Frameworks
Stream-to-Stream Join (one window
join)
Stream-to-Stream Join (two window
join)
Stream-to-
Stream
Join
Stream-to-
Stream
Join
Time
Time
31. gschmutz
State Management
Stream Processing – Concepts and Frameworks
Needed if use case is dependent
on previously seen data
Windowing, Joining and Pattern
Detection use State Management
behind the scenes
State needs to be as close to the
stream processor as possible
How does it handle failures?
Options for State Management
In-Memory
Replicated,
Distributed
Store
Local,
Embedded
Store
Operational Complexity and Features
Low high
32. gschmutz
Queryable State (aka. Interactive Queries)
Stream Processing – Concepts and Frameworks
Exposes state managed by
Stream Analytics solution
Allows application to query
managed state, i.e. to
visualize it
can eliminate need for an
external database to keep
results
Stream Processing Infrastructure
Reference
Data
Stream Analytics
{ }
Query API
State
Stream
Processor
Search /
Explore
Online &
Mobile Apps
Model
Dashboard
33. gschmutz
Event Pattern Detection
Stream Processing – Concepts and Frameworks
• Streaming Data often
contain interesting
patterns
• Special operators allow
finding complex
relationships between
events
• Absence Pattern - event A not followed by
event B within time window
• Sequence Pattern - event A followed by event
B followed by event C
• Increasing Pattern - up trend of a value of a
certain attribute
• Decreasing Pattern - down trend of a value of
a certain attribute
• …
34. gschmutz
Capabilities: Stream Data Integration vs. Stream Analytics
Stream Processing – Concepts and Frameworks
Stream Data Integration Stream Analytics
Support for Various Data Sources yes -
Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial
Execution Mode: Native Streaming yes yes
Execution Mode: Non-Native Streaming - Micro-Batching yes partial
Delivery Guarantees yes yes
API : GUI-Based API / Declarative API / Programmatic yes yes
API: Streaming SQL - yes
Event Time vs. Ingestion / Processing Time - yes
Windowing - yes
Stream-to-Static Joins (Lookup/Enrichment) partial yes
Stream-to-Stream Joins - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Event Pattern Detection - Yes
36. gschmutz
Stream Processing & Analytics Ecosystem
Stream Processing – Concepts and Frameworks
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
37. gschmutz
Event Hub: Apache Kafka
Stream Processing – Concepts and Frameworks
Kafka Cluster
Consumer Consumer Consumer
Broker 1 Broker 2 Broker 3
Zookeeper
Ensemble
ZK 1 ZK 2ZK 3
Schema
Registry
Service 1
Management
Control Center
Kafka Manager
KAdmin
Producer Producer Producer
kafkacat
Data Retention
• Never
• Time (TTL) or Size-based
• Log-Compacted based
38. gschmutz
Stream Data Integration: Kafka Connect
Stream Processing – Concepts and Frameworks
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json" -d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position" }
}'
• declarative style data flows
• framework is part of Kafka
• Many connectors available
• Single Message Transforms
(SMT)
39. gschmutz
Stream Data Integration: StreamSets
• GUI-based, drag-and drop Data
Flow Pipelines
• Both stream and batch processing
• special option for Edge computing
• custom sources, sinks, processors
• Monitoring and Error Detection
Stream Processing – Concepts and Frameworks
40. gschmutz
Stream Analytics: Kafka Streams
• Programmatic API, “just” a Java library
• Native streaming
• fault-tolerant local state
• Fixed, Sliding and Session Windowing
• Stream-Stream / Stream-Table Joins
• At-least-once and exactly-once
KTable<Integer, Customer> customers = builder.stream(”customer");
KStream<Integer, Order> orders = builder.stream(”order");
KStream<Integer, String> joined = orders.leftJoin(customers, …);
joined.to(”orderEnriched");
trucking_
driver
Kafka Broker
Java Application
Kafka Streams
Stream Processing – Concepts and Frameworks
41. gschmutz
Stream Analytics: KSQL
• Stream Processing with zero coding using
SQL-like language
• part of Confluent Platform (community
edition)
• built on top of Kafka Streams
• interactive (CLI) and headless (command file)
CREATE STREAM customer_s WITH (kafka_topic='customer', value_format='AVRO');
SELECT * FROM customer_s WHERE address->country = 'Switzerland';
...
trucking_
driver
Kafka Broker
KSQL Engine
Kafka Streams
KSQL CLI Commands
Stream Processing – Concepts and Frameworks
43. gschmutz
Summary
Stream Processing – Concepts and Frameworks
• Stream Processing is the solution for low-latency
• Event Hub, Stream Data Integration and Stream Analytics are the main
building blocks in your architecture
• Kafka is currently the de-facto standard for Event Hub
• Various options exists for Stream Data Integration and Stream Analytics
• SQL becomes a valid option for implementing Stream Analytics