This document discusses using KSQL to process streaming data from Apache Kafka. It begins with an example use case of processing an ecommerce site event stream in real-time. It then provides an introduction to KSQL, describing its key concepts of streams and tables. The document outlines KSQL's features like aggregation, windowing, joins, and nested data support. It demonstrates defining streams and tables in KSQL, performing queries and transformations on streams, and joining streams with tables for enrichment. Finally, it suggests next steps of downloading Confluent Platform and exploring KSQL examples and demos.
2. Bio
Hojjat Jafarpour
● Software Engineer @ Confluent
○ Creator of KSQL
● Previously at NEC Labs, Informatica, Quantcast and
Tidemark
○ Worked on various big data projects
● Ph.D. in Computer Science from UC Irvine
○ Scalable stream processing and Publish/Subscribe systems
● @Hojjat
10. KSQL: the Streaming SQL Engine for Apache Kafka
® from Confluent
● Enables stream processing with zero coding required
● The simplest way to process streams of data in real
time
● Powered by Kafka: scalable, distributed, battle-tested
● All you need is Kafka–no complex deployments of
bespoke systems for stream processing
11. What is it for?
Streaming ETL
● Kafka is popular for data pipelines.
● KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
12. What is it for?
Anomaly Detection
● Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
13. What is it for?
Real Time Monitoring
● Log data monitoring, tracking and alerting
● Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
17. alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
18. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
19. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
20. Streams & Tables
● STREAM and TABLE as first class citizens
○ Interpretation of the topic content
● STREAM: data in motion
● TABLE: collected state of a stream
○ One record per key (per window)
○ Current values (compacted topic)
○ Changelog
26. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
27. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
28. SELECTing from the Stream
Let’s test our new stream definition by finding all the
low-scoring ratings from our iPhone app
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’
LIMIT 10;
29. SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
30. SELECTing from the Stream
And set this to run as a continuous transformation,
with results being saved into a new topic
SELECT *
FROM ratings
WHERE stars <= 2
AND lcase(channel) LIKE ‘%ios%’;
CREATE STREAM poor_ratings AS
33. Joins for Enrichment
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
34. Joins for Enrichment
CREATE STREAM vip_poor_ratings AS
SELECT uid, name, elite,
stars, route_id, rating_time,
message
FROM poor_ratings r
LEFT JOIN users u ON r.user_id = u.uid
WHERE u.elite = 'P';
Enrich the ‘poor_ratings’ stream with data about each user, and derive a
stream of low quality ratings posted only by our Platinum Elite users
35. Aggregates and Windowing
● COUNT, SUM, MIN, MAX
● Windowing - Not strictly ANSI SQL
● Three window types supported:
○ TUMBLING
○ HOPPING (aka ‘sliding’)
○ SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;
36. Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
37. Continuous Aggregates
Save the results of our aggregation to a TABLE
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name
HAVING count(*) > 2;
CREATE TABLE sad_vips AS
38. Where to go from here?
Time to get involved!
● Download Confluent Platform
● Step through the QuickStart
● Play with the examples and demos
http://confluent.io/ksql
https://github.com/confluentinc/ksql
https://slackpass.io/confluentcommunity #ksql
Editor's Notes
A quick intro to KSQL in case they missed Niel’s talk.
STREAM and TABLE are both first-class citizens in KSQL
Both of these are interpretations of topic content. Topics are what *are*. Streams and tables are KSQL abstractions.
STREAM - data in motion. An unbounded sequence of facts (aka events, messages).
TABLE - collected state of a stream. An evolving collection of facts.
One record per key (per window)
Changelog