Companies are recognizing the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, enableing low latency analytics, event-driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.
In this talk we’ll see how easy it is to stream data from sources such as databases and into Kafka using the Kafka Connect API. We’ll use KSQL to filter, aggregate and join it to other data, and then stream this enriched data from Kafka out into targets such as Elasticsearch. All of this can be accomplished without a single line of code!
Streaming ETL to Elastic with Apache Kafka and KSQL
1. 1
Streaming ETL to Elastic
with Kafka and KSQL
San Francisco Elasticon, 1 March 2018
Nick Dearden
2. Kafka
Cluster
2
Apache Kafka®
Kafka
A Distributed Commit Log. Publish and subscribe to
streams of records. Highly scalable, high throughput.
Supports transactions. Persisted data.
Reads are a single seek & scan
Writes are
append only
3. 3
Apache Kafka®
Kafka Streams API
Write standard Java applications & microservices
to process your data in real-time
Kafka Connect API
Reliable and scalable integration of Kafka
with other systems – no coding required.
Orders
Table
Customers
Kafka Streams API
18. 19
KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent
• Enables stream processing with zero coding required
• The simplest way to process streams of data in real-time
• Powered by Kafka: scalable, distributed, battle-tested
• All you need is Kafka–No complex deployments of bespoke
systems for stream processing
19. What is it for ?
● Streaming ETL
○ Kafka is popular for data pipelines.
○ KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
20. What is it for ?
● Anomaly Detection
○ Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
21. What is it for ?
● Real Time Monitoring
○ Log data monitoring, tracking and alerting
○ Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
25. 26
Filtering streams with KSQL
ksql> CREATE STREAM ERROR_LOGS AS
SELECT * FROM LOGS
WHERE RESPONSE >=400;
Message
----------------------------
Stream created and running
----------------------------
26. 27
Streaming Transformations with KSQL
Raw logs Error logs
SLA
breaches
Elasticsearch
HDFS / S3
Alert App
KSQL
Filter / Aggregate / Join
App
Server
27. 28
Monitoring thresholds with KSQL
ksql> CREATE TABLE SLA_BREACHES AS
SELECT RESPONSE, COUNT(*) AS REQUEST_COUNT
FROM LOGS
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE RESPONSE>=400
GROUP BY RESPONSE
HAVING COUNT(*) > 10;
28. 29
Streaming Transformations with KSQL
Raw logs Error logs
SLA
breaches
Elasticsearch
HDFS / S3
Alert App
KSQL
Filter / Aggregate / Join
App
Server
29. 30
Confluent Platform: Enterprise Streaming based on Apache Kafka®
Database Changes Log Events loT Data Web Events …
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka®
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | CLI
Apache Open Source Confluent Open Source Confluent Enterprise
SQL Stream Processing
KSQL