1Confidential
Streaming Transformations -
Putting the T in Streaming ETL
Nick Dearden, Director of Engineering
2
Nick, Director of Engineering, is one of the masterminds and
co-creators of KSQL. He is also a technology and product
leader at Confluent and brings with him many years of
experience in the world of data and analytic systems to help
design and explain the power of a streaming platform for every
business.
Nick Dearden
Director of Engineering, Confluent
3
Housekeeping Items
● This session will be one hour.
● Submit questions by entering them into the GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available.
Streaming ETL, powered by Apache Kafka and Confluent Platform
KSQL
Single Message Transform (SMT) -- Extract, TRANSFORM,
Load…
• Modify events before storing in Kafka:
• Mask/drop sensitive information
• Set partitioning key
• Store lineage
• Modify events going out of Kafka:
• Route high priority events to faster
data stores
• Direct events to different
Elasticsearch indexes
• Cast data types to match destination
Single Message Transforms
http://kafka.apache.org/documentation.html#connect_transforms
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Single Message Transforms
http://kafka.apache.org/documentation.html#connect_transforms
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Record data
Bespoke
lineage data
8
Built-in Transforms
https://docs.confluent.io/current/connect/transforms/index.html
Sink properties : Converters
• Json, Avro, String, Protobuf, etc
• Specify the converter in the Kafka Connect configuration, e.g.
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
• Kafka Connect uses pluggable converters for both message-key and
message-value deserialization
Single Message Transforms plus Converters
• Modify events before storing in Kafka:
• Mask/drop sensitive information
• Set partitioning key
• Store lineage
• Cast data types
• Modify events going out of Kafka:
• Direct events to different targets
• Mask/drop sensitive information
• Cast data types to match destination
Complete Sink Definition
{
"name": "es-sink-rental-lengths-02",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"schema.ignore": "true",
"connection.url": "http://localhost:9200",
"type.name": "type.name=kafka-connect",
"topics": "RENTAL_LENGTHS",
"topic.index.map": "RENTAL_LENGTHS:rental_lengths",
"key.ignore": "true"
}
}
More complex example: multiple transformations for different targets
Raw logs Error logs
SLA
breaches
Elasticsearch
HDFS / S3
Alert App
KSQL
Filter / Aggregate / Join
Source
Aggregates and Windowing
• COUNT, SUM, MIN, MAX
• Windowing - Not strictly ANSI SQL ☺
• Three window types supported:
• TUMBLING
• HOPPING (aka ‘sliding’)
• SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;
15
Joins for Enrichment
CREATE STREAM vip_poor_ratings AS
SELECT uid, name, elite, stars, message
FROM poor_ratings r
LEFT JOIN users u ON r.user_id = u.uid
WHERE u.elite = ‘Platinum';
Enrich the ‘poor_ratings’ stream with data about each user, and derive a stream of low
quality ratings posted only by our Platinum Elite users
16
Stream/Table Duality
17
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
18
Streams & Tables
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream
• One record per key (per window)
• Current values (compacted topic) ← Not yet in KSQL
• Changelog
● STREAM – TABLE Joins
19
Join Types in KSQL
• Stream -> Table
• enrichment & lookups
• Stream -> Stream
• correlate streams of events in overlapping time range e.g. ad-impressions and ad-clicks
• Table -> Table
• you probably already know this one ☺
• LEFT, RIGHT, INNER, OUTER
<- included in GA version
<- included in next preview version
<- included in next preview version
20
Discovering Functions
21
DESCRIBE FUNCTION foo;
22
Sneak Peek! Function Source Code
6
Resources and Next Steps
https://confluent.io
https://confluent.io/ksql
https://slackpass.io/confluentcommunity
#ksql
@confluentinc
25
Questions?
26
Thank you for joining us!

Streaming Transformations - Putting the T in Streaming ETL

  • 1.
    1Confidential Streaming Transformations - Puttingthe T in Streaming ETL Nick Dearden, Director of Engineering
  • 2.
    2 Nick, Director ofEngineering, is one of the masterminds and co-creators of KSQL. He is also a technology and product leader at Confluent and brings with him many years of experience in the world of data and analytic systems to help design and explain the power of a streaming platform for every business. Nick Dearden Director of Engineering, Confluent
  • 3.
    3 Housekeeping Items ● Thissession will be one hour. ● Submit questions by entering them into the GoToWebinar panel. ● The last 10-15 minutes will consist of Q&A. ● The slides and recording will be available.
  • 4.
    Streaming ETL, poweredby Apache Kafka and Confluent Platform KSQL
  • 5.
    Single Message Transform(SMT) -- Extract, TRANSFORM, Load… • Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different Elasticsearch indexes • Cast data types to match destination
  • 6.
  • 7.
  • 8.
  • 9.
    Sink properties :Converters • Json, Avro, String, Protobuf, etc • Specify the converter in the Kafka Connect configuration, e.g. key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter • Kafka Connect uses pluggable converters for both message-key and message-value deserialization
  • 10.
    Single Message Transformsplus Converters • Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Cast data types • Modify events going out of Kafka: • Direct events to different targets • Mask/drop sensitive information • Cast data types to match destination
  • 11.
    Complete Sink Definition { "name":"es-sink-rental-lengths-02", "config": { "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "schema.ignore": "true", "connection.url": "http://localhost:9200", "type.name": "type.name=kafka-connect", "topics": "RENTAL_LENGTHS", "topic.index.map": "RENTAL_LENGTHS:rental_lengths", "key.ignore": "true" } }
  • 12.
    More complex example:multiple transformations for different targets Raw logs Error logs SLA breaches Elasticsearch HDFS / S3 Alert App KSQL Filter / Aggregate / Join Source
  • 13.
    Aggregates and Windowing •COUNT, SUM, MIN, MAX • Windowing - Not strictly ANSI SQL ☺ • Three window types supported: • TUMBLING • HOPPING (aka ‘sliding’) • SESSION SELECT uid, name, count(*) as rating_count FROM vip_poor_ratings WINDOW TUMBLING(size 5 minutes) GROUP BY uid, name;
  • 14.
    15 Joins for Enrichment CREATESTREAM vip_poor_ratings AS SELECT uid, name, elite, stars, message FROM poor_ratings r LEFT JOIN users u ON r.user_id = u.uid WHERE u.elite = ‘Platinum'; Enrich the ‘poor_ratings’ stream with data about each user, and derive a stream of low quality ratings posted only by our Platinum Elite users
  • 15.
  • 16.
    17 alice 1 alice 1 charlie1 alice 2 charlie 1 alice 2 charlie 1 bob 1 TABLE STREAM TABLE (“alice”, 1) (“charlie”, 1) (“alice”, 2) (“bob”, 1) alice 1 alice 1 charlie 1 alice 2 charlie 1 alice 2 charlie 1 bob 1
  • 17.
    18 Streams & Tables ●STREAM and TABLE as first-class citizens ● Interpretations of topic content ● STREAM - data in motion ● TABLE - collected state of a stream • One record per key (per window) • Current values (compacted topic) ← Not yet in KSQL • Changelog ● STREAM – TABLE Joins
  • 18.
    19 Join Types inKSQL • Stream -> Table • enrichment & lookups • Stream -> Stream • correlate streams of events in overlapping time range e.g. ad-impressions and ad-clicks • Table -> Table • you probably already know this one ☺ • LEFT, RIGHT, INNER, OUTER <- included in GA version <- included in next preview version <- included in next preview version
  • 19.
  • 20.
  • 21.
  • 22.
    6 Resources and NextSteps https://confluent.io https://confluent.io/ksql https://slackpass.io/confluentcommunity #ksql @confluentinc
  • 23.
  • 24.
    26 Thank you forjoining us!