"Streaming data is rapidly becoming a key component in modern applications, and Apache Kafka has emerged as a popular and powerful platform for managing and processing these data streams. However, as the volume and complexity of streaming data continue to grow, it becomes increasingly critical to have efficient and effective ways of querying and analyzing this data.
This is where query engines like Apache Flink, Trino, Timeplus, Materialize, and ksqlDB come in. These powerful tools offer flexible and scalable ways of processing and analyzing streaming data in real-time, enabling users to extract valuable insights from their data streams.
In this talk, we will introduce the audience to the world of querying streaming data on Apache Kafka with SQL, compare and contrast the features and capabilities of each of these tools, and provide an in-depth analysis of their respective Pros and Cons. We will also discuss the best practices and scenarios where each tool is most effective.
In conclusion, query engines like Apache Flink, Trino, ksqlDB, Timeplus, Materialize and are useful tools in processing and analyzing streaming data on Kafka. With their ability to extract valuable insights from real-time data streams, these tools are a valuable asset for modern data-driven applications."
Query Your Streaming Data on Kafka using SQL: Why, How, and What
1. Query Kafka with SQL
Jove Zhong
Co-Founder and Head of Product, Timeplus
Gang Tao
Co-Founder and CTO, Timeplus
Why, how, and what’s next?
Sep 27, 2023
3. Real-time data is everywhere, at the edge and cloud
46 ZB
of data created by
billions of IoT by 2025
30%
of data generated will be
real-time by 2025
Only 1%
of data is analyzed and
streaming data is
primarily untapped
5. Why SQL on Database?
ret = open_database(&(my_stock->inventory_dbp)..);
my_database->get(my_database, NULL, &key, &data, 0);
client.get(key)
update_bins = {'b'=: u"ud83dude04" 'i': aerospike.null()}
client.put(key, update_bins)
request = new GetItemRequest()
.withKey(key_to_get)
.withTableName(table_name);
SELECT * FROM tab WHERE id='id1'
UPDATE tab SET flag=FALSE WHERE id='id1'
6. Why SQL on Kafka?
Reliable Fast Easy
Powerful Descriptive
7. FinTech
● Real-time post-trade analytics
● Real-time pricing
DevOps
● Real-time Github insights
● Real-time o11y and usage based
pricing
Security Compliance
● SOC2 compliance
● Container vulnerability monitoring
● Monitor Superblocks user activities
● Protect sensitive info in Slack
IoT
● Real-time fleet monitoring
Customer 360
● Auth0 notifications for new signups
● HubSpot custom dashboards/alerts
● Jitsu clickstream analytics
● Real-time Twitter marketing
Misc
● Wildfire monitoring and alerting
● Data-driven parent
Sample Use Cases
source: https://docs.timeplus.com/showcases
8. How do you like your coffee?
Flink ksqlDB Hazelcast
Druid Pinot
Trino
ClickHouse StarRocks
RisingWave Databend
Streaming
Processor
Streaming
Database
Real-time
Database
12. Distributed computation and storage platform
No dependency on disk storage, it keeps all its
operational state in the RAM of the cluster.
Flink ksqlDB Hazelcast
Druid Pinot
Trino
Streaming
Processor
Streaming
Database
Real-time
Database
13. 1. create a schema json (columns, PKs)
2. create a table configuration json (streamType=Kafka)
3. docker run .. apachepinot/pinot:latest AddTable
-schemaFile /tmp/transcript-schema.json
-tableConfigFile /tmp/transcript-table-realtime.json
..
-exec
1. load the druid-kafka-indexing-service extension on both the
Overlord and the MiddleManagers
2. Create a supervisor-spec.json containing the Kafka
supervisor spec file.
3. curl -X POST -H 'Content-Type: application/json' -d
@supervisor-spec.json
http://localhost:8090/druid/indexer/v1/supervisor
14. Add a catalog properties file etc/catalog/kafka.properties for the Kafka connector.
$ ./trino --catalog kafka --schema aSchema
trino:aSchema> SELECT count(*) FROM customer;
17. SELECT * FROM car_live_data
Stream tail
SELECT count(*) FROM car_live_data
Global
aggregation
SELECT window_start, count(*)
FROM tumble(car_live_data, 1m)
GROUP BY window_start
Window
aggregation
SELECT cid,
speed_kmh,
lag(speed_kmh) OVER
(PARTITION BY cid) AS last_spd
FROM car_live_data
Sub streams
SELECT window_start, count(*)
FROM tumble(car_live_data, 5s)
GROUP BY window_start
EMIT AFTER WATERMARK AND DELAY 2s
Late event
SELECT *
FROM car_live_data
WHERE
_tp_time > now() - 1d
Time travel
26. Programing - turn data into insight
human
machine
1GL - machine language
2GL - assembly language
3GL - imperative language
4GL - descriptive language
5GL - intelligent language
data
insight
27. source
Streaming
Processor
● SQL as data pipeline
● No data storage
● Unbounded real-time
query
ETL / Data Pipeline
ingest
external
Realtime
Database
● mostly leveraging kafka to
ingest data
● federation search/query
○ ClickHouse Kafka Engine
○ Trino
● Bounded batch query, no
streaming query
Historical Report / Ad hoc Analysis
source
Streaming
Database
● support kafka data
storage
● Unbounded real-time
query
● combination of
real-time data and
historical data
Hybrid
28. Query Kafka with SQL: Open Source + Cloud + Source Available
Flink
ksqlDB Hazelcast Druid Pinot Trino
ClickHouse StarRocks
RisingWave
Databend
Streaming Processor Streaming Database Realtime Database
29. Community
☕☕☕☕
Real-time ☕☕☕
Streaming ☕☕☕
Historical ☕
JOIN
☕☕☕☕
Largescale
☕☕☕☕
Lightweight☕☕
Easy to use☕☕
Community ☕☕☕
Real-time ☕☕☕
Streaming ☕☕☕
Historical ☕☕
JOIN ☕☕☕
Largescale ☕☕
Lightweight☕☕
Easy to use☕☕☕
Community ☕☕
Real-time
☕☕☕☕
Streaming ☕☕☕
Historical
☕☕☕☕
JOIN
☕☕☕☕
Largescale ☕☕
Lightweight☕☕☕
☕
Easy to use☕☕☕
Community ☕☕☕
Real-time ☕☕☕
Streaming ☕☕☕
Historical ☕☕
JOIN ☕☕☕
Largescale ☕☕☕
Lightweight☕☕☕
☕
Easy to use☕☕☕
30. Q+A / Thank you!
Meet us at booth #407
Try Timeplus Proton (Open Source)
Or sign up for a free cloud account
timeplus.com