Companies new and old are all recognizing the importance of a low-latency, scalable, fault-tolerant data backbone - in the form of the Apache Kafka streaming platform. With Kafka developers can integrate multiple systems and data sources to enable low-latency analytics, event-driven architectures, and the population of downstream systems. What's more, these data pipelines can be built using configuration alone.
In this talk, we'll see how easy it is to capture a stream of data changes in real-time from a database such as MySQL into Kafka using the Kafka Connect framework and then use KSQL to filter, aggregate and join it to other data, and finally stream the results from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of Java code!
2. Kafka
Cluster
2
Apache Kafka®
Kafka
A Distributed Commit Log. Publish and subscribe to
streams of records. Highly scalable, high throughput.
Supports transactions. Persisted data.
Reads are a single seek & scan
Writes are
append only
3. 3
Apache Kafka®
Kafka Streams API
Write standard Java applications & microservices
to process your data in real-time
Kafka Connect API
Reliable and scalable integration of Kafka
with other systems – no coding required.
Orders
Table
Customers
Kafka Streams API
13. Confluent Partner Briefing 13
What does a streaming platform do?
Publish and
subscribe to streams
of data
similar to a message
queue or enterprise
messaging system.
110101
010111
001101
100010
Store streams
of data
in a durable, fault-
tolerant way.
110101
010111
001101
100010
Process
streams of data
in real time, as they
occur.
110101
010111
001101
100010
23. Streaming Application Data to Kafka
• Applications are rich source of events
• Modifying applications is not always possible or
desirable
• And what if the data gets changed within the
database or by other apps?
• JDBC is one option for extracting data
• Confluent Open Source includes JDBC
source & sink connectors
24. Liberate Application Data into Kafka with CDC
• Relational databases use transaction logs to
ensure Durability of data
• Change-Data-Capture (CDC) mines the log to get
raw events from the database
• CDC tools that integrate with Kafka Connect
include:
• Debezium
• DBVisit
• GoldenGate
• Attunity
• + more
28. Single Message Transform (SMT) -- Extract, TRANSFORM,
Load…
• Modify events before storing in Kafka:
• Mask/drop sensitive information
• Set partitioning key
• Store lineage
• Modify events going out of Kafka:
• Route high priority events to faster
data stores
• Direct events to different
Elasticsearch indexes
• Cast data types to match destination
34. 36
KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent
• Enables stream processing with zero coding required
• The simplest way to process streams of data in real-time
• Powered by Kafka: scalable, distributed, battle-tested
• All you need is Kafka–No complex deployments of bespoke
systems for stream processing
36. KSQL Example
● Streaming ETL
○ Kafka is popular for data pipelines.
○ KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
37. KSQL Example
● Anomaly Detection
○ Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
38. KSQL Example
● Real Time Monitoring
○ Log data monitoring, tracking and alerting
○ Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
41. 43
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
42. 44
Streams & Tables
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream
• One record per key (per window)
• Current values (compacted topic) ← Not yet in KSQL
• Changelog
● STREAM – TABLE Joins
43. Aggregates and Windowing
• COUNT, SUM, MIN, MAX
• Windowing - Not strictly ANSI SQL ☺
• Three window types supported:
• TUMBLING
• HOPPING (aka ‘sliding’)
• SESSION
SELECT uid, name, count(*) as rating_count
FROM vip_poor_ratings
WINDOW TUMBLING(size 5 minutes)
GROUP BY uid, name;
44. Streaming ETL with Kafka Connect and KSQL
MySQL
Kafka
Connect
Kafka
Cluster
rental
rental_lengths
long_rentals
Elasticsearch
CREATE STREAM RENTAL_LENGTHS AS
SELECT END_DATE - START_DATE […]
FROM RENTAL
Kafka
Connect
CREATE STREAM LONG_RENTALS AS
SELECT … FROM RENTAL_LENGTHS
WHERE DURATION > 14
54. 56
More complex example: multiple transformations for different targets
Raw logs Error logs
SLA
breaches
Elasticsearch
HDFS / S3
Alert App
KSQL
Filter / Aggregate / Join
Source
55. Confluent Partner Briefing 57
Confluent and the Confluent Platform
Confluent Enterprise
Monitoring, Multi-DC, Security + more
Confluent Open Source
Connectors, Clients, KSQL + more
Apache Kafka
Pub/Sub + Streams + Connect
56. 58
Resources & Next Steps
Your turn !
• Download Confluent Platform
• Step through the QuickStart
• Play with the examples and demos
http://confluent.io/download
https://www.confluent.io/blog/simplest-useful-kafka-
connect-data-pipeline-world-thereabouts-part-1
https://slackpass.io/confluentcommunity