This document introduces KSQL, a streaming SQL engine for Apache Kafka. It provides concise summaries of KSQL's capabilities and how to use it in 3 sentences or less:
KSQL allows users to easily query and transform data in Kafka streams using SQL-like queries. It provides simplicity, flexibility, and scalability compared to directly using Kafka Streams APIs. KSQL can be run in standalone, client-server, or application modes and is well-suited for tasks like streaming ETL, anomaly detection, monitoring, and IoT data processing.
6. 6
On the Shoulders of (Streaming) Giants
• Native, 100%-compatible Kafka integration
• Secure stream processing using Kafka’s security features
• Elastic and highly scalable
• Fault-tolerant
• Stateful and stateless computations
• Interactive queries
• Time model
• Supports late-arriving and out-of-order data
• Windowing
• Millisecond processing latency, no micro-batching
• At-least-once and exactly-once processing guarantees
7. 7
Where can KSQL help ?
• Streaming ETL
• Kafka is popular for data pipelines.
• KSQL enables easy transformations of data within the pipe
• Anomaly Detection
• Identifying patterns or anomalies in real-time data, surfaced in milliseconds
• Monitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
• Simple derivations of existing topics
• One line to re-partition and/or re-key a topic for new uses
8. 8
Where is KSQL not such a great fit ?
• Ad-hoc query
• Limited span of time usually retained in Kafka
• No indexes
• BI reports (Tableau etc.)
• No indexes
• No JDBC (most BI tools are not good with continuous results!)
12. 12
Schema and Format
• A Kafka broker knows how to move <sets of bytes>
• Technically a message is ( ts, byte[], byte[] )
• SQL-like queries require a richer structure
• Start with message (value) format
• JSON
• DELIMITED (comma-separated in this preview)
• AVRO – requires that you supply a .avsc schema-file
• Pseudo-columns are automatically generated
• ROWKEY, ROWTIME
13. 13
CREATE STREAM clickstream (
time bigint,
url varchar,
status int,
bytes int,
userid varchar,
agent varchar)
WITH (
value_format = ‘JSON',
kafka_topic=‘my_clickstream_topic');
14. 14
SELECTing from the Stream
SELECT *
FROM clickstream
WHERE status >= 400
AND lcase(url) LIKE ‘%html'
CREATE STREAM error_stream AS
LIMIT 5;
15. 15
Aggregates and Windowing
• COUNT, SUM, MIN, MAX
• Windowing - Not strictly ANSI SQL
• Three window types supported:
• TUMBLING
• HOPPING (aka ‘sliding’)
• SESSION
SELECT url, ip, count(*) as problem_count
FROM clickstream
WINDOW TUMBLING(size 1 minute)
WHERE loglevel = ‘ERROR’
GROUP BY url, ip;
16. 16
Joins for Enrichment
CREATE STREAM vip_actions AS
SELECT userid, fullname, url, status
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
18. 18
KSQL Components
• CLI
• Designed to be familiar to users of MySQL, Postgres, etc
• Engine
• Actually runs the Kafka Streams topologies
• REST Server
• HTTP interface allows an Engine to receive instructions from the CLI
19. 19
How to run KSQL - #1 Stand-alone aka ‘local mode’
• Starts a CLI, an Engine, and a REST server all in the same JVM
• Ideal for laptop development
• Start with default settings:
> bin/ksql-cli local
• Or with customized settings:
> bin/ksql-cli local –-properties-file foo/bar/ksql.properties
20. 20
How to run KSQL - #2 Client-Server
• Start any number of Server nodes
• > bin/ksql-server-start
• Start any number of CLIs and specify ‘remote’ server address
• >bin/ksql-cli remote http://myserver:8090
• All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up/down without restart
21. 21
How to run KSQL - #3 as an Application
How do you deploy applications ?
22. 22
How to run KSQL - #3 as an Application
• Start any number of Engine instances
• Pass a file of KSQL statements to execute
> bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
• Version control your queries and transformations as code
• All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up/down without restart
23. 23
Dedicating resources
• Join Engines to the same ‘service pool’ by means of the ksql.service.id property
• In both Client-Server and Application modes
24. 24
Developer Preview
• Known Issues
• May leave orphaned topics from interactive or terminated queries
• Caveats
• No ‘ORDER BY’
• Stream-stream joins
• Limited function library (see syntax guide on GitHub)
• APIs / KSQL Syntax may change depending on feedback
25. 25
Resources & Next Steps
Time to get involved !
• Try the Quickstart on Github
• Check out the code
• Play with the examples
The point of ‘developer preview’ is that we can change things for the better, together
https://github.com/confluentinc/ksql
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql