Apache Kafka Streams + Machine Learning / Deep Learning

1Confidential
Apache Kafka + Machine Learning
Analytic Models Applied to Real Time Stream Processing
Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
LinkedIn
@KaiWaehner
www.kai-waehner.de

2Apache Kafka and Machine Learning
Agenda
1) Machine Learning in the Real World
2) Building an Analytic Model
3) Applying an Analytic Model in Real Time
4) Online Training of Models

Agenda

Machine Learning
... allows computers to find hidden insights without being
explicitly programmed where to look.

Real World Examples of Machine Learning
Spam Detection
Search Results +
Product Recommendation
Picture Detection
(Friends, Locations, Products)
Your Company
The Next Disruption:
Google Beats Go Champion

Leverage Machine Learning to Analyze and Act on Critical Business Moments
Seconds Minutes Hours
Price
Optimization
Predictive
Maintenance
Fraud
Detection
Cross
Selling
Transportation
Rerouting
Customer
Service
Inventory
Management
Windows of Opportunity

How to realize
these use cases?

Big Data Analytics
Volume
(terabytes,
petabytes)
Variety
(social networks,
blog posts, logs,
sensors, etc.)
Velocity
(„real time“)
Value

Big Data Analytics for Actionable Insights
From Insight to Action
(continuously closed loop)

Streaming Platform
Big Data Analytics
Database
IoT Device
Streaming
Producer
…..
DWH
Data
Integration
C
O
N
N
E
C
T
C
O
N
N
E
C
T
Data Lake
Model
Building
Batch
Real
Time
Stream
Processing
REST
Interface
IoT Device
Mobile App
Streaming
Consumer
C
O
N
N
E
C
T
C
O
N
N
E
C
T
BI Tool
Messaging
Web
Application
Model
Schema Registry
/ Governance
1) Data Producer
2) Analytics Platform
3) Streaming Platform
4) Data Consumer

Agenda

Streaming Platform
Big Data Analytics
Database
IoT Device
Streaming
Producer
…..
DWH
Data
Integration
C
O
N
N
E
C
T
C
O
N
N
E
C
T
Data Lake
Model
Building
Batch
Real
Time
Stream
Processing
REST
Interface
IoT Device
Mobile App
Streaming
Consumer
C
O
N
N
E
C
T
C
O
N
N
E
C
T
BI Tool
Messaging
Web
Application
Model
Schema Registry
/ Governance
1) Data Producer
4) Data Consumer

Hidden Technical Debt in Machine Learning Systems
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Writing
source code
is not the
time-consuming
task!
!

Analytical Pipeline
1. Data Access
2. Data Preparation
3. Exploratory Data Analysis
4. Model Building
5. Model Execution
6. Model Validation
7. Deployment

Data Access
Find insights to create
added business value
by correlating
various data sources!

Data Preparation
http://www.slideshare.net/odsc/feature-engineering
Data Preparation

Exploratory Data Analysis
© Copyright 2000-2017 TIBCO Software Inc.
• Scripting
• Visual Analytics
• Machine Learning

Model Building
A model is a simplification of the truth
that helps you with decision making.

Model Execution (Coding)
Apply Model
to New Data

Model Execution (Tooling)
Apply Model
to New Data

Model Validation
https://genome.tugraz.at/proclassify/help/pages/XV.html
Cross-Validation
Procedure

Frameworks
and Tooling?

Languages, Frameworks and Tools
Many more ….
Portable Format
for Analytics (PFA)

Live Demos with Open Source Technologies
Development of Analytic Models
with R, TensorFlow, Apache Spark, H2O.ai, RapidMiner

Live Demo
Use Case:
Customer Churn Prediction
Machine Learning Algorithm:
Generalized Linear Model (GLM)
using Logistic Regression
Technology:
Open Source R

Live Demo
Use Case:
Airline Flight Delay Prediction
Gradient Boosted Machines (GBM)
using Decision Trees
Technology:
H2O.ai

Live Demo
Use Case:
Predictive Maintenance
(Anomaly Detection in Telco Networks)
Deep Learning Algorithm:
Artificial Neural Networks (ANN)
using Autoencoders
Technology:
TensorFlow + Python API

Live Demo
Use Case:
Classification
(Prediction of Titanic Survivors)
Deep Learning Algorithm:
Recurrent Neural Networks (RNN)
Technology:
RapidMiner

Agenda

Analytical Pipeline
1. Data Access
2. Data Preparation
4. Model Building
5. Model Execution
6. Model Validation
7. Deployment

Streaming Platform
Big Data Analytics
Database
IoT Device
Streaming
Producer
…..
DWH
Data
Integration
C
O
N
N
E
C
T
C
O
N
N
E
C
T
Data Lake
Model
Building
Batch
Real
Time
Stream
Processing
REST
Interface
IoT Device
Mobile App
Streaming
Consumer
C
O
N
N
E
C
T
C
O
N
N
E
C
T
BI Tool
Messaging
Web
Application
Model
Schema Registry
/ Governance
1) Data Producer
4) Data Consumer

Definition of Stream Processsing
Data at Rest Data in Motion

Key Concepts

Stream Processing
Use Cases
• Real Time Applications
• Stateful Streaming Analytics
• Stateless “Real Time ETL”

Event Processing Windows
Various Options for Windowing (Fixed, Sliding, Session, …)

How to
apply analytic models
to real time processing
without redevelopment?

Application of Analytic Models to Real Time without Redevelopment
Stream
Processing
H20.ai
R
Python
Spark ML
MATLAB
SAS
PMML

Streaming Analytics - Processing Pipeline
APIs
Adapters /
Channels
Integration
Messaging
Stream
Ingest
Transformation
Aggregation
Enrichment
Filtering
Stream
Preprocessing
Process
Management
Analytics
(Real Time)
Applications
& APIs
Analytics /
DW Reporting
Stream
Outcomes
• Contextual Rules
• Windowing
• Patterns
• Analytics
• Machine Learning
• …
Stream
Analytics
Index / SearchNormalization
Applying an Analytic Model
is just a piece of the puzzle!

Frameworks
and Tooling?

Frameworks and Products
OPEN SOURCE CLOSED SOURCE
PRODUCT
FRAMEWORK
Azure Microsoft
Stream Analytics

When to use Kafka Streams for Stream Processing?

When to use Kafka Streams for Stream Processing?
No need for a
Big Data cluster
Deploy in your
existing infrastructure
Kafka manages
scalability / fail-over
Focus on development
of business logic
in your department

Kafka Streams
Map, filter, aggregate,
apply analytic model,
„any business logic“
Input Stream
(Kafka Topic)
Kafka Cluster
Output Stream
(Kafka Topic)
Kafka Cluster
Stream Processing
Microservice
(Kafka Streams)
Deployed anywhere:
Docker, Kubernetes,
Mesos, Java App, …

A complete streaming microservices, ready for production at large-scale
Word
Count
App configuration
Define processing
(here: WordCount)
Start processing

Confluent Platform: the Free, Open-Source Streaming Platform
Open Source ExternalCommercial
Confluent Platform
Monitoring
Analytics
Custom Apps
Transformations
Real-time
Applications
…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Control Center
Auto-data
Balancing
Multi-Data
Center Replication
24/7 Support
Supported
Connectors
Clients
Schema
Registry
REST
Proxy
Apache Kafka
Kafka
Connect
Kafka
Streams
Kafka
Core
Database Changes Log Events loT Data Web Events …

Streaming Platform
Big Data Analytics
Database
IoT Device
Streaming
Producer
…..
DWH
Data
Integration
C
O
N
N
E
C
T
C
O
N
N
E
C
T
Data Lake
Model
Building
Batch
Real
Time
Stream
Processing
REST
Interface
IoT Device
Mobile App
Streaming
Consumer
C
O
N
N
E
C
T
C
O
N
N
E
C
T
BI Tool
Messaging
Web
Application
Model
Schema Registry
/ Governance
1) Data Producer
4) Data Consumer

STREAMING PLATFORM
BIG DATAANALYTICS
Oracle DB
CoaP IoT
Kafka
Java Client
…..
HP Vertica
Data
Integration
F
L
U
M
E
H2O.ai,
Spark,
TensorFlow
Batch
Real
Time
Confluent
REST Proxy
MQTT IoT
iPhone App
Kafka
Go Client
C
K O
A N
F N
K E
A C
T
H
I
V
E
Grafana
Kafka
Java EE
Web App
Hadoop
C
K O
A N
F N
K E
A C
T
Confluent
Schema Registry
Kafka Streams
H2O.ai
Mesos
Kafka Streams
TensorFlow
Kubernetes
Avro
Avro
1) Data Producer
4) Data Consumer

Live Demos with Open Source Technologies
Development of Analytic Models
with Apache Kafka Messaging, Kafka Streams, Kafka Connect, Confluent Schema Registry

Live Demo
Use Case:
Airline Flight Delay Prediction
Any! (in our example, H2O.ai GBM)
Streaming Platform:
Apache Kafka Core, Kafka Connect,
Kafka Streams, Confluent Schema Registry

H2O.ai Model + Kafka Streams
Filter
Map
1) Create H2O ML model
2) Configure Kafka Streams Application
3) Apply H2O ML model to Streaming Data
4) Start Kafka Streams App

End-to-End Stream Monitoring and Alerting
Confluent Control Center
Data Stream Monitoring and Alerting
Multi-cluster monitoring and management
Kafka Connect Configuration
• Message delivery?
• Delays?
• Where got it stuck?
• Lost messages?
• Broker issues?
• Performance?
http://docs.confluent.io/3.2.0/control-center/docs/monitoring.html

Agenda

Let’s improve
the analytic model
continuously…

Analytical Pipeline
1. Data Access
2. Data Preparation
4. Model Building
5. Model Execution
6. Model Validation
7. Deployment
Online
Training
Continuously train and improve the model with every new event

Online Model Training of Analytic Models
How to improve models?
1.Manual Update
2.Automated Batch
3.Real Time

STREAMING PLATFORM
BIG DATAANALYTICS
F
L
U
M
E
H2O.ai,
Spark,
TensorFlow
H
I
V
E
Kafka
Hadoop
Confluent
Schema Registry
Kafka Streams
H2O.ai
Mesos
Kafka Streams
TensorFlow
Kubernetes
Avro
Avro
1) Get new Input Event
via Kafka Topic
2) Improve Model in
Big Data Cluster
3) Update deployed Model
via Kafka Topic
4) Leverage
Improved Model
for new Events

Caveats for Online Model Training
• Processes and infrastructure not ready
• Validation needed before production
• Slows down the system
• Only a few ML implementations supported
• Many use cases do not need it

Key Take-Aways
Ø Insights are hidden in Historical Data on Big Data Platforms
Ø Machine Learning and Big Data Analytics find these Insights by building Analytics Models
Ø Streaming Platform uses these Models (without Redevelopment) to take Action in Real Time

Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
@KaiWaehner
www.kai-waehner.de
LinkedIn
Questions? Feedback?
Please contact me!

Apache Kafka Streams + Machine Learning / Deep Learning

More Related Content

What's hot

Viewers also liked

Similar to Apache Kafka Streams + Machine Learning / Deep Learning

More from Kai Wähner

Recently uploaded

Apache Kafka Streams + Machine Learning / Deep Learning