Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
1. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL in Kafka
for Everyone with KSQL
Software Engineer, Confluent Inc.
Hojjat Jafarpour
2. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hojjat Jafarpour
2
Software Engineer at Confluent
○Starter KSQL project at Confluent
Previously at Tidemark, Quantcast, Informatica and
NEC Labs
PhD in Computer Science from UC Irvine
○Data management, pub/sub and streaming
hojjat@confluent.io
@hojjat
3. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, with Apache Kafka and Confluent
Platform
3
4. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
4
5. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
5
6. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
6
7. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
7
8. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Kafka Connect : Stream data in and out of Kafka
8
Amazon
S3
9. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Single Message Transform (SMT)
9
▪ Modify events before storing in
Kafka:
o Mask/drop sensitive information
o Set partitioning key
o Store lineage
▪ Modify events going out of
Kafka:
o Route high priority events to faster
data stores
o Direct events to different
Elasticsearch indexes
o Cast data types to match destination
10. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
10
But I need to
join…aggregate…filter
…
11. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL from Confluent
11
A Developer Preview of
KSQL
An Open Source Streaming SQL
Engine for Apache KafkaTM
12. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL: a Streaming SQL Engine for Apache Kafka™ from
Confluent
▪ Enables stream processing with zero coding required
▪ The simplest way to process streams of data in real-time
▪ Powered by Kafka: scalable, distributed, battle-tested
▪ All you need is Kafka–No complex deployments of bespoke
systems for stream processing
12
Ksql>
13. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL: the Simplest Way to Do Stream Processing
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
13
14. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL Concepts
▪ STREAM and TABLE as first-class citizens
o Interpretations of topic content
▪ STREAM - data in motion
▪ TABLE - collected state of a stream
o One record per key (per window)
o Current values (compacted topic) ← Not yet in KSQL
▪ STREAM – TABLE Joins
14
15. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gap-less windows
• SELECT ip, count(*) AS hits FROM clickstream
WINDOW TUMBLING (size 1 minute) GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream
WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
• SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream
WINDOW SESSION (20 second) GROUP BY ip;
15
16. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, powered by Apache Kafka and Confluent
Platform
16
KSQL
17. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
● Pageview stream
● User table
● Materialized views
o Region visitor count
o Region visitor demography
17
CREATE STREAM pageviews (viewtime BIGINT, userid VARCHAR, pageid VARCHAR) WITH
(kafka_topic='pageviews', value_format=JSON);
CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid
VARCHAR) WITH (kafka_topic='users', value_format='JSON');
18. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
18
Region visitor count
CREATE STREAM joined_pageviews AS
SELECT users.userid AS userid, pageid, regionid, gender
FROM pageviews LEFT JOIN users ON pageviews.userid = users.userid;
CREATE TABLE region_visitor_count AS
SELECT regionid , COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY regionid;
19. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
19
Region visitor demography
CREATE TABLE region_visitor_demo_count AS
SELECT regionid, gender, COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY gender, regionid;
20. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, powered by Apache Kafka and Confluent
Platform
20
KSQL
21. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Confluent Platform: Enterprise Streaming based on Apache
Kafka™
21
Database
Changes
Log Events loT Data
Web
Events
…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time
Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka™
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | KSQL | CLI
22. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Date to remember
22
• Kafka Summit 2018
• April 23-24 in London!
• More details:
https://kafka-summit.org/
23. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
hojjat@confluent.io
@hojjat
Please stay in touch
Any questions?
https://github.com/confluentinc/ksql/
https://www.confluent.io/download/