KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain 2018)

KSQL
The Open Source Streaming SQL Engine for Apache Kafka
Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
LinkedIn
@KaiWaehner
www.confluent.io
www.kai-waehner.de

2Confidential
1.0 Enterprise
Ready J
A Brief History of Apache Kafka and Confluent
0.11 Exactly-once
semantics
0.10 Data processing
(Streams API)
0.9 Data integration
(Connect API)
Intra-cluster
replication
0.8
2012 2014
Cluster mirroring0.7
2015 2016 20172013 2018
CP 4.1
KSQL GA

3Confidential
KSQL – The Streaming SQL Engine for Apache Kafka

4KSQL- Streaming SQL for Apache Kafka
Why KSQL?
Population
CodingSophistication
Realm of Stream Processing
New, Expanded Realm
BI
Analysts
Core
Developers
Data
Engineers
Core Developers
who don’t like
Java
Kafka
Streams
KSQL

Shoulders of Streaming Giants
subscribe(), poll(), send(),
flush(), beginTransaction(), …
KStream, KTable, filter(), map(), flatMap(), join(),
aggregate(), transform(), …
CREATE STREAM, CREATE TABLE,
SELECT, JOIN, GROUP BY, SUM, …
KSQL UDFs

KSQL for Data Exploration and Debugging
An easy way to inspect your data in Kafka
SHOW TOPICS;
SELECT page, user_id, status, bytes
FROM clickstream
WHERE user_agent LIKE 'Mozilla/5.0%';
PRINT 'my-topic' FROM BEGINNING;

KSQL for Data Transformation
Quickly make derivations of existing data in Kafka
CREATE STREAM clicks_by_user_id
WITH (PARTITIONS=6,
TIMESTAMP='view_time’
VALUE_FORMAT='JSON') AS
SELECT * FROM clickstream
PARTITION BY user_id;
Change number of partitions1
Convert data to JSON2
Repartition the data3

KSQL for Real-Time, Streaming ETL
Filter, cleanse, process data while it is in motion
CREATE STREAM clicks_from_vip_users AS
SELECT user_id, u.country, page, action
FROM clickstream c
LEFT JOIN users u ON c.user_id = u.user_id
WHERE u.level ='Platinum'; Pick only VIP users1

Example: CDC from DB via Kafka to Elastic

KSQL for Real-time Data Enrichment
Join data from a variety of sources to see the full picture
CREATE STREAM enriched_payments AS
SELECT payment_id, c.country, total
FROM payments_stream p
LEFT JOIN customers_table c
ON p.user_id = c.user_id;
Stream-Stream Join2
Stream-Table Join1

Example: Retail

KSQL for Real-Time Monitoring
Derive insights from events (IoT, sensors, etc.) and turn them into actions
CREATE TABLE failing_vehicles AS
SELECT vehicle, COUNT(*)
FROM vehicle_monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE event_type = 'ERROR’
GROUP BY vehicle
HAVING COUNT(*) >= 5; Now we know to alert, and whom1

Example: IoT, Automotive, Connected Cars
streams

KSQL for Anomaly Detection
Aggregate data to identify patterns and anomalies in real-time
CREATE TABLE possible_fraud AS
SELECT card_number, COUNT(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 30 SECONDS)
GROUP BY card_number
HAVING COUNT(*) > 3;
Aggregate data1
… per 30-sec windows2

Example: Anomaly Detection with Deep Learning (Autoencoder)
“CREATE STREAM AnomalyDetection AS
SELECT sensor_id, detectAnomaly(sensor_values)
FROM car_engine;“
User Defined Function (UDF)
https://github.com/kaiwaehner/
ksql-udf-deep-learning-mqtt-iot

Independent Dev / Test / Prod of different Apps and Microservices

No Matter Where it Runs

KSQL Concepts
● No need for source code
• Zero, none at all, not even one line.
• No SerDes, no generics, no lambdas, ...
● All the Kafka and Kafka Streams “magic” out-of-the-box
• Exactly Once Semantics
• Windowing
• Event-time aggregation
• Late-arriving data
• Distributed, fault-tolerant, scalable, ...

KSQL is Equally viable for S / M / L / XL / XXL use cases
Ok. Ok. Ok.
… and KSQL is ready for production, including 24/7 support!

Fault-Tolerance, powered by Kafka

STREAM and TABLE as first-class citizens

WINDOWing
● Not ANSI SQL ! à Continuous Queries
• TUMBLING
• SELECT appname, ip, COUNT(appname) AS problem_count FROM
logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR'
GROUP BY appname, ip;
• HOPPING
• SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING
(size 20 second, advance by 5 second) GROUP BY itemid;
• SESSION
• SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION
(20 second) GROUP BY itemid;

KSQL - Components
KSQL has 3 main components:
1. The Engine which actually runs the Kafka Streams topologies
2. The REST server interface enables an Engine to receive instructions from the CLI
or any other client
3. The CLI, designed to be familiar to users of MySQL, Postgres etc.
(Note that you also need a Kafka Cluster… KSQL is deployed independently)

KSQL can be used interactively + programmatically
ksql> POST /query
1UI
2CLI
3REST
4Headless

Architecture (Client – Server Mode)
JVM
KSQL Server
KSQL CLI or any REST Client
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster

Architecture (Headless Mode)
JVM
KSQL Server
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster

Dedicating resources
Join Engines to the same
‘service pool’ by means of the
ksql.service.id property

User Defined Functions (UDF, UDAF)
Write UDF code in Java, mark with annotations @UdfDescription, @Udf.
SELECT address, STRINGLENGTH(address->street) FROM orders;
Make UDF available to KSQL (next slides), then use it like any other KSQL function in your queries:
The UDF name in KSQL queries is
whatever you define in the `name` field in
the annotation (here: “stringLength”).

Live Demo
KSQL in Action

KSQL Quick Start – Getting Started in Minutes!
https://docs.confluent.io/
current/quickstart/index.html
Local runtime
or
Docker container

Demo - Clickstream Analysis
• https://docs.confluent.io/current/ksql/docs/tutorials/clickstream-docker.html#ksql-clickstream-
docker
• Leverages Apache Kafka, Kafka Connect, KSQL, Elasticsearch and Grafana
• 5min screencast: https://www.youtube.com/watch?v=A45uRzJiv7I
• Setup in 5 minutes (with or without Docker)
SELECT STREAM
CEIL(timestamp TO HOUR) AS timeWindow, productId,
COUNT(*) AS hourlyOrders, SUM(units) AS units
FROM Orders GROUP BY CEIL(timestamp TO HOUR),
productId;
timeWindow | productId | hourlyOrders | units
------------+-----------+--------------+-------
08:00:00 | 10 | 2 | 5
08:00:00 | 20 | 1 | 8
09:00:00 | 10 | 4 | 22
09:00:00 | 40 | 1 | 45
... | ... | ... | ...

KSQL Recipes
https://www.confluent.io/stream-processing-cookbook

Resources and Next Steps
Get Involved
• Try the Quickstart on GitHub
• Check out the code
• Play with the examples
KSQL is GA… You can already use it for production deployments!
https://github.com/confluentinc/ksql
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql

KSQLis the
Streaming
SQL Engine
for
Apache Kafka

Questions?
Kai Waehner
Technology Evangelist
kontakt@kai-waehner.de
LinkedIn
@KaiWaehner
www.confluent.io
www.kai-waehner.de

KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain 2018)

More Related Content

What's hot

Similar to KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain 2018)

More from Kai Wähner

Recently uploaded

KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain 2018)