PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL in Kafka
for Everyone with KSQL
Software Engineer, Confluent Inc.
Hojjat Jafarpour
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hojjat Jafarpour
2
Software Engineer at Confluent
○Starter KSQL project at Confluent
Previously at Tidemark, Quantcast, Informatica and
NEC Labs
PhD in Computer Science from UC Irvine
○Data management, pub/sub and streaming
hojjat@confluent.io
@hojjat
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, with Apache Kafka and Confluent
Platform
3
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
4
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
5
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
6
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
7
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Kafka Connect : Stream data in and out of Kafka
8
Amazon
S3
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Single Message Transform (SMT)
9
▪ Modify events before storing in
Kafka:
o Mask/drop sensitive information
o Set partitioning key
o Store lineage
▪ Modify events going out of
Kafka:
o Route high priority events to faster
data stores
o Direct events to different
Elasticsearch indexes
o Cast data types to match destination
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
10
But I need to
join…aggregate…filter
…
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL from Confluent
11
A Developer Preview of
KSQL
An Open Source Streaming SQL
Engine for Apache KafkaTM
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL: a Streaming SQL Engine for Apache Kafka™ from
Confluent
▪ Enables stream processing with zero coding required
▪ The simplest way to process streams of data in real-time
▪ Powered by Kafka: scalable, distributed, battle-tested
▪ All you need is Kafka–No complex deployments of bespoke
systems for stream processing
12
Ksql>
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL: the Simplest Way to Do Stream Processing
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
13
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
KSQL Concepts
▪ STREAM and TABLE as first-class citizens
o Interpretations of topic content
▪ STREAM - data in motion
▪ TABLE - collected state of a stream
o One record per key (per window)
o Current values (compacted topic) ← Not yet in KSQL
▪ STREAM – TABLE Joins
14
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gap-less windows
• SELECT ip, count(*) AS hits FROM clickstream
WINDOW TUMBLING (size 1 minute) GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream
WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
• SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream
WINDOW SESSION (20 second) GROUP BY ip;
15
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, powered by Apache Kafka and Confluent
Platform
16
KSQL
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
● Pageview stream
● User table
● Materialized views
o Region visitor count
o Region visitor demography
17
CREATE STREAM pageviews (viewtime BIGINT, userid VARCHAR, pageid VARCHAR) WITH
(kafka_topic='pageviews', value_format=JSON);
CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid
VARCHAR) WITH (kafka_topic='users', value_format='JSON');
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
18
Region visitor count
CREATE STREAM joined_pageviews AS
SELECT users.userid AS userid, pageid, regionid, gender
FROM pageviews LEFT JOIN users ON pageviews.userid = users.userid;
CREATE TABLE region_visitor_count AS
SELECT regionid , COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY regionid;
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Simple Web Analytics Pipeline
19
Region visitor demography
CREATE TABLE region_visitor_demo_count AS
SELECT regionid, gender, COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY gender, regionid;
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Streaming ETL, powered by Apache Kafka and Confluent
Platform
20
KSQL
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Confluent Platform: Enterprise Streaming based on Apache
Kafka™
21
Database
Changes
Log Events loT Data
Web
Events
…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time
Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka™
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | KSQL | CLI
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Date to remember
22
• Kafka Summit 2018
• April 23-24 in London!
• More details:
https://kafka-summit.org/
PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
hojjat@confluent.io
@hojjat
Please stay in touch
Any questions?
https://github.com/confluentinc/ksql/
https://www.confluent.io/download/

Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

  • 1.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Streaming ETL in Kafka for Everyone with KSQL Software Engineer, Confluent Inc. Hojjat Jafarpour
  • 2.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Hojjat Jafarpour 2 Software Engineer at Confluent ○Starter KSQL project at Confluent Previously at Tidemark, Quantcast, Informatica and NEC Labs PhD in Computer Science from UC Irvine ○Data management, pub/sub and streaming hojjat@confluent.io @hojjat
  • 3.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Streaming ETL, with Apache Kafka and Confluent Platform 3
  • 4.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company 4
  • 5.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company 5
  • 6.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company 6
  • 7.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company 7
  • 8.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Kafka Connect : Stream data in and out of Kafka 8 Amazon S3
  • 9.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Single Message Transform (SMT) 9 ▪ Modify events before storing in Kafka: o Mask/drop sensitive information o Set partitioning key o Store lineage ▪ Modify events going out of Kafka: o Route high priority events to faster data stores o Direct events to different Elasticsearch indexes o Cast data types to match destination
  • 10.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company 10 But I need to join…aggregate…filter …
  • 11.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company KSQL from Confluent 11 A Developer Preview of KSQL An Open Source Streaming SQL Engine for Apache KafkaTM
  • 12.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent ▪ Enables stream processing with zero coding required ▪ The simplest way to process streams of data in real-time ▪ Powered by Kafka: scalable, distributed, battle-tested ▪ All you need is Kafka–No complex deployments of bespoke systems for stream processing 12 Ksql>
  • 13.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company KSQL: the Simplest Way to Do Stream Processing CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; 13
  • 14.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company KSQL Concepts ▪ STREAM and TABLE as first-class citizens o Interpretations of topic content ▪ STREAM - data in motion ▪ TABLE - collected state of a stream o One record per key (per window) o Current values (compacted topic) ← Not yet in KSQL ▪ STREAM – TABLE Joins 14
  • 15.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Window Aggregations Three types supported (same as KStreams): ● TUMBLING: Fixed-size, non-overlapping, gap-less windows • SELECT ip, count(*) AS hits FROM clickstream WINDOW TUMBLING (size 1 minute) GROUP BY ip; ● HOPPING: Fixed-size, overlapping windows • SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip; ● SESSION: Dynamically-sized, non-overlapping, data-driven window • SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream WINDOW SESSION (20 second) GROUP BY ip; 15
  • 16.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Streaming ETL, powered by Apache Kafka and Confluent Platform 16 KSQL
  • 17.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Simple Web Analytics Pipeline ● Pageview stream ● User table ● Materialized views o Region visitor count o Region visitor demography 17 CREATE STREAM pageviews (viewtime BIGINT, userid VARCHAR, pageid VARCHAR) WITH (kafka_topic='pageviews', value_format=JSON); CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid VARCHAR) WITH (kafka_topic='users', value_format='JSON');
  • 18.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Simple Web Analytics Pipeline 18 Region visitor count CREATE STREAM joined_pageviews AS SELECT users.userid AS userid, pageid, regionid, gender FROM pageviews LEFT JOIN users ON pageviews.userid = users.userid; CREATE TABLE region_visitor_count AS SELECT regionid , COUNT(*) AS visit_count FROM joined_pageviews WINDOW TUMBLING (size 30 second) GROUP BY regionid;
  • 19.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Simple Web Analytics Pipeline 19 Region visitor demography CREATE TABLE region_visitor_demo_count AS SELECT regionid, gender, COUNT(*) AS visit_count FROM joined_pageviews WINDOW TUMBLING (size 30 second) GROUP BY gender, regionid;
  • 20.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Streaming ETL, powered by Apache Kafka and Confluent Platform 20 KSQL
  • 21.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Confluent Platform: Enterprise Streaming based on Apache Kafka™ 21 Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI
  • 22.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company Date to remember 22 • Kafka Summit 2018 • April 23-24 in London! • More details: https://kafka-summit.org/
  • 23.
    PRESENTATION TITLE ONONE LINE AND ON TWO LINES First and last name Position, company THANK YOU hojjat@confluent.io @hojjat Please stay in touch Any questions? https://github.com/confluentinc/ksql/ https://www.confluent.io/download/