Introduction to Stream Processing

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Introduction to
Stream Processing
Guido Schmutz
Java Zone 2018 – 13.9.2018
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Introduction to Stream Processing

Agenda
1. Motivation for Stream Processing?
2. Capabilities for Stream Processing
3. Implementing Stream Processing Solutions
4. Summary

Motivation for Stream Processing?

Why not using Traditional BI Infrastructures?
Enterprise Data
Warehouse
ETL / Stored
Procedures
Bulk Source
DB
Extract
File
DB
Architekturen von Big Data Anwendungen
BI Tools
Search / Explore
Enterprise Apps
Logic
{ }
API
high latency

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Event Stream

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Event
Hub
Event
Hub
Event
Hub
Telemetry

"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
1110
1010
1010
110
1110
1010
1010
110

Typical Stream Processing Use Cases
• Notifications and Alerting - a notification or alert should be triggered if
some sort of event or series of events occurs.
• Real-Time Reporting – run real-time dashboards that
employees/customers can look at
• Incremental ETL – still ETL, but not in Batch but in streaming, continuous
mode
• Update data to serve in real-time – compute data that get served
interactively by other applications
• Real-Time decision making – analyzing new inputs and responding to
them automatically using business logic, i.e. Fraud Detection
• Online Machine Learning – train a model on a combination of historical
and streaming data and use it for real-time decision making

When to Stream / When not?
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
Source: adapted from Cloudera

"No free lunch"
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
"Difficult" architectures, lower latency "Easier architectures", higher latency

Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Telemetry

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Telemetry

Data Store
Integrate existing systems through CDC
Data
Event Hub
Integration
Consuming Systems
StateLogic
CDC
CDC Connector
Traditional Silo-based
System
LogicUser Interface
Capture changes directly on database
Change Data Capture (CDC) => think like
a global database trigger
Transform existing systems to event
producer
Event
Stream
Event
Stream

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Telemetry

New systems participate in event-oriented fashion
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event
Stream
Event
Stream
Telemetry

Edge computing allows processing close to data sources
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platfrom
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Telemetry

Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Event
Hub
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry

Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• primarily focuses on the ingestion and
processing of data sources targeting real-
time extract-transform-load (ETL) and data
integration use cases
• filter and enrich the data
• optionally calculate time-windowed
aggregations before storing the results in a
database or file system
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business through real-time dashboards,
alerts or decision automation

Important Capabilities for Stream
Processing

Streaming Model: Event-at-a-time vs. Micro Batch
Event-at-a-time Processing
• Events processed as they
arrive
• low-latency
• fault tolerance expensive
Micro-Batch Processing
• Splits incoming stream in
small batches
• Fault tolerance easier
• Better throughput

Fault Tolerance: Delivery Guarantees
• At most once (fire-and-forget) - means the message is sent, but the sender
doesn’t care if it’s received or lost. Imposes no additional overhead to ensure
message delivery, such as requiring acknowledgments from consumers. Hence, it
is the easiest and most performant behavior to support.
• At least once - means that retransmission of a message will occur until an
acknowledgment is received. Since a delayed acknowledgment from the receiver
could be in flight when the sender retransmits the message, the message may be
received one or more times.
• Exactly once - ensures that a message is received once and only once, and is
never lost and never repeated. The system must implement whatever
mechanisms are required to ensure that a message is received and processed
just once
[ 0 | 1 ]
[ 1+ ]
[ 1 ]

API
Compositional / Programmatic
• Highly customizable operator based
on basic building blocks
• Manual topology definition and
optimization
Declarative
• High-Level, fluent API
• Higher order function as operators
(filter, mapWithState …)
• Logical plan optimization
SQL
• Query language allowing to use
stream in FROM clause
• Extensions supporting windowing,
pattern matching, spatial, ….
Operators
• SELECT * FROM <stream> WHERE
f1 = 10

Windowing
Computations over events done using
windows of data
Due to size and never-ending nature of it,
it’s not feasible to keep entire stream of
data in memory
A window of data represents a certain
amount of data where we can perform
computations on
Windows give the power to keep a
working memory and look back at recent
data efficiently
Time
Stream of Data Window of Data

Sliding Window (aka
Hopping Window) - uses
eviction and trigger policies
that are based on time: window
length and sliding interval
length
Fixed Window (aka Tumbling
Window) - eviction policy always
based on the window being full
and trigger policy based on
either the count of items in the
window or time
Session Window – composed
of sequences of temporarily
related events terminated by a
gap of inactivity greater than
some timeout
Windowing
Time TimeTime

Joining
Challenges of joining streams
1. Data streams need to be aligned as they
come because they have different timestamps
2. since streams are never-ending, the joins
must be limited; otherwise join will never end
3. join needs to produce results continuously as
there is no end to the data
Stream to Static (Table) Join
Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-
to-Static
Join
Stream-
to-Stream
Join
Stream-
to-Stream
Join
Time
Time
Time

Pattern Detection
• Streaming Data often contains interesting patterns that only emerge as new
streaming data arrives, e.g.
• Absence Pattern: event A not followed by event B within time window
• Sequence Pattern: event A followed by event B followed by event C
• Increasing Pattern: up trend of a value of a certain attribute
• Decreasing Pattern: down trend of a value of a certain attribute
• …
• Pattern operators allow developers to define complex relationships between
streaming events

State Management
Necessary if stream processing use case
is dependent on previously seen data or
external data
Windowing, Joining and Pattern
Detection use State Management behind
the scenes
State Management services can be made
available for custom state handling logic
State needs to be managed as close to
the stream processor as possible
Options for State Management
How does it handle failures? If a machine
crashes and the/some state is lost?
In-Memory
Replicated,
Distributed
Store
Local,
Embedded
Store
Operational Complexity and Features
Low high

Queryable State (aka. Interactive Queries)
Exposes the state managed by the
Stream Analytics solution to the
outside world
Allows an application to query the
managed state, i.e. to visualize it
For some scenarios, Queryable State
can eliminate the need for an external
database to keep results
Stream Processing Cluster
Reference
Data
Stream Analytics
{ }
Query API
State
Stream
Processor
Search /
Explore
Online &
Mobile
Apps
Model
Dashboard

Event Time vs. Ingestion Time vs. Processing Time
Event time
• the time at which events actually occurred
Ingestion time / Processing Time
• the time at which events are ingested /
observed in the system
Not all use cases care about event times but many do
Examples
• characterizing user behavior over time
• most billing applications
• anomaly detection

Capabilities: Stream Data Integration vs. Stream Analytics
Stream Data Integration Stream Analytics
Support for Various Data Sources Yes (yes)
Streaming ETL (Transformation, Routing, Validation) yes (yes)
Format Translation yes (yes)
Micro-Batching yes (yes)
Auto Scaling yes
Backpressure Support yes -
Processing Time vs. Event Time - yes
Stream-to-Static Joins (Lookup/Enrichment) yes yes
Stream-to-Stream Joins - yes
Windowing - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Streaming SQL (yes) yes
Event Pattern Matching / Detection - Yes

Implementing Stream Processing
Solutions

Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge

Solutions:
"Event Hub"

Implementing "Event Hub"
Hadoop Clusterd
Hadoop Cluster
Cluster Infrastructure
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Big Data Batch Analytics
Stream Analytics
Modern Applications

Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget

Hold Data for Long-Term – Data Retention
Producer 1
Broker 1
Broker 2
Broker 3
1. Never
2. Time based (TTL)
log.retention.{ms | minutes | hours}
3. Size based
log.retention.bytes
4. Log compaction based
(entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181
--create --topic customers
--replication-factor 1
--partitions 1
--config cleanup.policy=compact

Keep Topics in Compacted Form
0 1 2 3 4 5 6 7 8 9 10 11
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Offset
Key
Value
3 4 6 8 9 10
K1 K3 K4 K5 K2 K6
V4 V5 V7 V9 V10 V11
Offset
Key
Value
Compaction
V1
V2
V3 V4
V5
V6
V7
V8 V9
V1
0
V1
1
K1
K3
K4
K5
K2
K6

Demo (I) - Kafka
Truck-2
truck
position
Truck-1
Truck-3
console
consumer
Testdata-Generator by Hortonworks
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Demo (I) – Create Kafka Topic
$ kafka-topics --zookeeper zookeeper:2181 --create
--topic truck_position --partitions 8 --replication-factor 1
$ kafka-topics --zookeeper zookeeper:2181 –list
__consumer_offsets
_confluent-metrics
_schemas
docker-connect-configs
docker-connect-offsets
docker-connect-status
truck_position

Demo (I) – Run Producer and Kafka-Console-Consumer

Solutions:
Streaming Data Sources

Integrating Streaming Data Sources
Hadoop Clusterd
Hadoop Cluster
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Stream Analytics
Modern Applications

Integrating (Streaming) Data Sources
SQL Polling
Change Data Capture
(CDC)
File Polling
File Stream (File Tailing)
File Stream (Appender)
Sensor Stream

Demo (II) – devices send to MQTT instead of Kafka
Truck-2
truck/nn/
position
Truck-1
Truck-3
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837
Message Queue Telemetry Transport
(MQTT)
Reduces amount of bytes flowing over the
wire compared to HTTP -> reduces power
and bandwidth usage
Built in retry / QoS mechansim
Pub/Sub architecture and Message Broker
allows for flexibility in communication
patterns
Simple API, requires minimal "plumbing"
code

Demo (II) – devices send to MQTT instead of Kafka

Demo (II) - devices send to MQTT instead of Kafka –
how to get the data into Kafka?
Truck-2
truck/nn/
position
Truck-1
Truck-3
truck
position raw
?
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Solutions:
"Stream Data Integration"

Implementing "Stream Data Integration"
Hadoop Clusterd
Hadoop Cluster
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Stream Analytics
Modern Applications

Integration with or without
Transformation?
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
d

Apache NiFi
• Originated at NSA as Niagarafiles – developed
behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, "file-oriented" payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface

StreamSets Data Collector
Founded by ex-Cloudera, Informatica
employees
Continuous open source, intent-driven, big
data ingest
Visible, record-oriented approach fixes
combinatorial explosion
Batch or stream processing
• Standalone, Spark cluster, MapReduce
cluster
IDE for pipeline development by ‘civilians’
Relatively new - first public release
September 2015
So far, vast majority of commits are from
StreamSets staff

Demo (III) – StreamSets Data Collector
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position
console
consumer
2016-06-02
14:39:56.605,98,27,803014426,
Wichita to Little Rock Route2,
Normal,38.65,90.21,51872977366525026
31
Truck-2
truck/nn/
position
Truck-1
Edge
mqtt to
kafka

Demo (III): Dataflow for MQTT to Kafka

Demo (III): MQTT Source

Demo (III): Kafka Sink

Kafka Connect - Overview
Source
Connecto
r
Sink
Connecto
r

Demo (IV)
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position
console
consumer
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Demo (IV) – Create MQTT Connect through REST API
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json"
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"connect.mqtt.connection.timeout": "1000",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883»,
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.keepalive.interval.seconds ": "60",
"mqtt.qos":"0"
}
}'

Demo (IV)
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position
console
consumer
what about some
analytics ?
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Solutions:
"Stream Analytics"

Implementing "Stream Analytics"
Hadoop Clusterd
Hadoop Cluster
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice State
{ }
API
Stream
Processor
State
{ }
API
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Rules
Event Hub
Storage
Event
Hub
Event
Stream
Event
Stream
Event Stream
Replay
Stream Analytics
Modern Applications

Kafka Streams - Overview
• Designed as a simple and lightweight library in
Apache Kafka
• no external dependencies other than Kafka
• Leverages Kafka as its internal messaging layer
• Supports fault-tolerant local state
• Event-at-a-time processing (not micro-batch) with
millisecond latency
• Supports Windowing (Fixed, Sliding and Session)
and Stream-Stream / Stream-Table Joins
• Interactive Query
• Support for Event Time with late arrivals
• Millisecond processing latency, no micro-batching
• At-least-once and exactly-once processing
guarantees
KStream<Integer, String> stream1 =
builder.stream("in-1");
KStream<Integer, String> stream2=
builder.stream("in-2");
KStream<Integer, String> joined =
stream1.leftJoin(stream2, …);
KTable<> aggregated =
joined.groupBy(…).count("store");
aggregated.to("out-1");

Demo (V) – Kafka Streams
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position_s
detect_dang
erous_drivin
g
dangerous_
driving
console
consumer
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Demo (V) - Create Stream
final KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =
builder.stream(stringSerde, stringSerde, "truck_position");
KStream<String, TruckPosition> positions =
source.map((key,value) ->
new KeyValue<>(key, TruckPosition.create(key,value)));
KStream<String, TruckPosition> filtered =
positions.filter(TruckPosition::filterNonNORMAL);
filtered.map((key,value) -> new KeyValue<>(key,value.toCSV()))
.to("dangerous_driving");

KSQL: Streaming SQL for Apache Kafka
• Enables stream processing with zero coding required
• The simples way to process streams of data in real-time
• Powered by Kafka and Kafka Streams: scalable, distributed,
mature
• All you need is Kafka – no complex deployments
• STREAM and TABLE as first-class citizens
• STREAM = data in motion
• TABLE = collected state of a stream
• join STREAM and TABLE
Source: Confluent

Demo (VI) - KSQL
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position_s
detect_dang
erous_drivin
g
dangerous_
driving
console
consumer
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Demo (VI) - Start Kafka KSQL
$ docker-compose exec ksql-cli ksql-cli local --bootstrap-server broker-1:9092
======================================
= _ __ _____ ____ _ =
= | |/ // ____|/ __ | | =
= | ' /| (___ | | | | | =
= | < ___ | | | | | =
= | . ____) | |__| | |____ =
= |_|______/ __________| =
= =
= Streaming SQL Engine for Kafka =
Copyright 2017 Confluent Inc.
CLI v0.1, Server v0.1 located at http://localhost:9098
Having trouble? Type 'help' (case-insensitive) for a rundown of how things work!
ksql>

Demo (VI) - Create Stream
ksql> CREATE STREAM truck_position_s
(ts VARCHAR,
truckId VARCHAR,
driverId BIGINT,
routeId BIGINT,
eventType VARCHAR,
latitude DOUBLE,
longitude DOUBLE,
correlationId VARCHAR)
WITH (kafka_topic='truck_position',
value_format='DELIMITED');
Message
----------------
Stream created

ksql> SELECT * FROM truck_position_s;
1522847870317 | "truck/13/position0 | �1522847870310 | 44 | 13 | 1390372503 |
Normal | 41.71 | -91.32 | -2458274393837068406
1522847870376 | "truck/14/position0 | �1522847870370 | 35 | 14 | 1961634315 |
Normal | 37.66 | -94.3 | -2458274393837068406
1522847870418 | "truck/21/position0 | �1522847870410 | 58 | 21 | 137128276 |
Normal | 36.17 | -95.99 | -2458274393837068406
1522847870397 | "truck/29/position0 | �1522847870390 | 18 | 29 | 1090292248 |
Normal | 41.67 | -91.24 | -2458274393837068406
ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
1522847914246 | "truck/11/position0 | �1522847914240 | 54 | 11 | 1198242881 |
Lane Departure | 40.86 | -89.91 | -2458274393837068406
1522847915125 | "truck/10/position0 | �1522847915120 | 93 | 10 | 1384345811 |
Overspeed | 40.38 | -89.17 | -2458274393837068406
1522847919216 | "truck/12/position0 | �1522847919210 | 75 | 12 | 24929475 |
Overspeed | 42.23 | -91.78 | -2458274393837068406

ksql> CREATE STREAM dangerous_driving_s
WITH (kafka_topic= dangerous_driving_s',
value_format='JSON')
AS SELECT * FROM truck_position_s
WHERE eventtype != 'Normal';
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_s;
1522848286143 | "truck/15/position0 | �1522848286125 | 98 | 15 | 987179512 |
Overspeed | 34.78 | -92.31 | -2458274393837068406
1522848295729 | "truck/11/position0 | �1522848295720 | 54 | 11 | 1198242881 |
Unsafe following distance | 38.43 | -90.35 | -2458274393837068406
1522848313018 | "truck/11/position0 | �1522848313000 | 54 | 11 | 1198242881 |
Overspeed | 41.87 | -87.67 | -2458274393837068406

Demo (VII) – KSQL Join
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt-
source
truck_
position
detect_dang
erous_drivin
g
dangerous_
driving
Truck
Driver
jdbc-
source
trucking_
driver
join_dangero
us_driving_dr
iver
dangerous_d
riving_driver
27, Walter, Ward, Y, 24-JUL-85, 2017-10-02
15:19:00
console
consumer
{"id":27,"firstName":"Walter"
,"lastName":"Ward","availab
le":"Y","birthdate":"24-JUL-
85","last_update":15069230
52012}
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

Demo (VII) – Create JDBC Connect through REST API
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json"
-d $'{
"name": "jdbc-driver-source",
"config": {
"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:postgresql://db/sample?user=sample&password=sample",
"mode": "timestamp",
"timestamp.column.name":"last_update",
"table.whitelist":"driver",
"validate.non.null":"false",
"topic.prefix":"trucking_",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"name": "jdbc-driver-source",
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
}
}'

Demo (VII) - Create Table with Driver State
ksql> CREATE TABLE driver_t
(id BIGINT,
first_name VARCHAR,
last_name VARCHAR,
available VARCHAR)
WITH (kafka_topic='trucking_driver',
value_format='JSON',
key='id');
Message
----------------
Table created

Demo (VII) - Create Table with Driver State
ksql> CREATE STREAM dangerous_driving_and_driver_s
WITH (kafka_topic='dangerous_driving_and_driver_s',
value_format='JSON')
AS SELECT driverId, first_name, last_name, truckId, routeId, eventtype
FROM truck_position_s
LEFT JOIN driver_t
ON dangerous_driving_and_driver_s.driverId = driver_t.id;
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_and_driver_s;
1511173352906 | 21 | 21 | Lila | Page | 58 | 1594289134 | Unsafe tail distance
1511173353669 | 12 | 12 | Laurence | Lindsey | 93 | 1384345811 | Lane Departure
1511173435385 | 11 | 11 | Micky | Isaacson | 22 | 1198242881 | Unsafe tail
distance

Spark Structured Streaming
• Stream processing on Structured API (2nd generation
with DataFrames / Datasets rather than RDDs)
• Code reuse between batch and streaming
• Supports Windowing (Fixed, Sliding) and Stream-to-
Stream / Stream-to-Static Joins
• Support for Event Time with late arrivals
• Currently only supports Micro-Batching
• At-least-once and partial Exactly-once support
• Support for Java, Scala, Python
val schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(...)
.load()
val filteredDf = inputDf.where(...)
val query = filteredDf
.writeStream
.format(...)
.option(...)
.start()
query.stop

Demo (VIII) – Spark Structured Streaming
Truck-2
truck/nn/
position
Truck-1
Truck-3
mqtt to
kafka
truck_
position_s
detect_dang
erous_drivin
g
dangerous_
driving
console
consumer
1522846456703,101,31,1927624662,Normal,37.
31,-94.31,-4802309397906690837

val rawDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("subscribe", "truck_position")
.load()
val jsonDf = rawDf.select($"value" cast "string" as "json")
.select(from_json($"json", truckPositionSchema) as "data")
.selectExpr("data.*", "cast(cast (data.timestamp as double) /
1000 as timestamp) as eventTime")
val jsonFilteredDf = jsonDf.where("data.eventType !='Normal'”)

val query = jsonFilteredDf
.selectExpr("to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("topic","dangerous_driving_spark")
.option("checkpointLocation", "/tmp")
.start()

Summary

Summary
Stream Processing is the solution for low-latency
Event Hub, Stream Data Integration and Stream Analytics are the main building
blocks in your architecture
Kafka is currently the de-facto standard for Event Hub
Various options exists for Stream Data Integration and Stream Analytics
Kafka offers a full-blown streaming platform with support for all 3 options (although not
as flexible in Stream Data Integration)
Not yet complete (SQL, Event Pattern Detection, Streaming Machine Learning)

Technology on its own won't help you.
You need to know how to use it properly.

Introduction to Stream Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Stream Processing

Similar to Introduction to Stream Processing (20)

More from Guido Schmutz

More from Guido Schmutz (20)

Recently uploaded

Recently uploaded (20)

Introduction to Stream Processing