STREAMING SQL
Jungtaek Lim
WHO AM I?
• Software Engineer @ Hortonworks
• remote worker
• Open source prosumer
• PMC member of Apache Storm
• Committer of Jedis
• Contributor of Apache (Spark,
Zeppelin,Ambari, Calcite), Redis,
and so on.
• Contact: kabhwan@gmail.com
WHAT END USERS WANT?
• Performance < Easy to use
• Innovation of technology is making the thing easier and easier
to end-users
NOSQL ON HADOOP
MAPREDUCE SPARK SQL
NEXT FOR STREAMING?
SQL AGAIN!
STREAMING SQL
• Unbounded real time data
• can’t be fully covered in SQL standard and requires new ideas
• No standard yet
• Apache Calcite proposes its own Streaming SQL
• https://calcite.apache.org/docs/stream.html
• aggregation and stream-relation, stream-stream join is done within window
• most of things are not implemented yet
COMPARISON
Processing
unit
SQL-like API SQL
Streaming
SQL
Status
Apache Flink Tuple O O Not yet Experimental
Apache
Storm
Micro-batch
(Trident
based)
X
O
(Pure style)
Not yet Experimental
Apache Spark Micro-batch O O
Only with
Structured
Streaming API
Alpha
SIMPLE USE CASE
1. Get JSON from Kafka
2. Filter error logs (status >=
400)
3. Project columns with user
defined function and
calculations
4. Store rows back to Kafka
STORM SQL STATEMENTS
CREATE FUNCTION GET_TIME AS 'org.apache.storm.sql.runtime.functions.scalar.datetime.GetTime2'
CREATE EXTERNAL TABLE APACHE_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_urlVARCHAR,
request_methodVARCHAR, statusVARCHAR, request_header_user_agentVARCHAR, time_received_utc_isoformatVARCHAR,
time_us DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apachelogs' TBLPROPERTIES '{"producer":
{"bootstrap.servers":"localhost:
9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_url
VARCHAR, request_methodVARCHAR, status INT, request_header_user_agentVARCHAR, time_received_utc_isoformat
VARCHAR, time_received_timestamp BIGINT, time_elapsed_ms INT) LOCATION 'kafka://localhost:2181/brokers?
topic=apacheerrorlogs' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:
9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,TIME_RECEIVED_UTC_ISOFORMAT,
GET_TIME(TIME_RECEIVED_UTC_ISOFORMAT, 'yyyy-MM-dd''T''HH:mm:ssZZ') AS
TIME_RECEIVED_TIMESTAMP, (TIME_US / 1000) ASTIME_ELAPSED_MS FROM APACHE_LOGS WHERE
(CAST(STATUS AS INT) / 100) >= 4
Input topic Output topic
CALCITE PROPOSAL
https://calcite.apache.org/docs/stream.html
PROPOSAL
WINDOWING
SELECT STREAM
TUMBLE_END(rowtime,
INTERVAL '1' HOUR) AS
rowtime, productId
FROM Orders
GROUP BY
TUMBLE(rowtime,
INTERVAL '1' HOUR),
productId
HAVING COUNT(*) > 2 OR
SUM(units) > 10;
rowtime productId
10:00:00 30
11:00:00 10
11:00:00 40
PROPOSAL
STREAMTO RELATION JOIN
SELECT STREAM
o.rowtime, o.productId,
o.orderId, o.units,
p.name, p.unitPrice
FROM Orders AS o
JOIN Products AS p
ON o.productId =
p.productId;
rowtime productI
d
orderId units name unitPrice
10:17:00 30 5 4 Cheese 17
10:17:05 10 6 1 Beer 0.25
10:18:05 20 7 2 Wine 6
10:18:07 30 8 20 Cheese 17
11:02:00 10 9 6 Beer 0.25
11:04:00 10 10 1 Beer 0.25
11:09:30 40 11 12 Bread 100
11:24:11 10 12 4 Beer 0.25
PROPOSAL
STREAMTO RELATION JOIN (CONT.)
SELECT STREAM *
FROM Orders AS o
JOIN ProductVersions AS p
ON o.productId = p.productId
AND o.rowtime
BETWEEN p.startDate
AND p.endDate;
- ProductVersions is a temporal
versioned table
- unit price of product 10 is
increased to 0.35 at 11:00
rowtime productId orderId units
productId
1
name unitPrice
10:17:00 30 5 4 30 Cheese 17
10:17:05 10 6 1 10 Beer 0.25
10:18:05 20 7 2 20 Wine 6
10:18:07 30 8 20 30 Cheese 17
11:02:00 10 9 6 10 Beer 0.35
11:04:00 10 10 1 10 Beer 0.35
11:09:30 40 11 12 40 Bread 100
11:24:11 10 12 4 10 Beer 0.35
PROPOSAL
STREAMTO STREAM JOIN
SELECT STREAM o.rowtime,
o.productId, o.orderId, s.rowtime AS
shipTime
FROM Orders AS o
JOIN Shipments AS s
ON o.orderId = s.orderId
AND s.rowtime BETWEEN
o.rowtime AND o.rowtime +
INTERVAL '1' HOUR;
rowtime productId orderId shipTime
10:17:00 30 5 10:55:00
10:17:05 10 6 10:20:00
11:02:00 10 9 11:58:00
STILL NOT ENOUGH?
GUI
Drag and drop, configure, done!
img source: https://community.hortonworks.com/articles/8422/visualize-near-real-time-stock-price-changes-using.html
THANKS!

Streaming SQL

  • 1.
  • 2.
    WHO AM I? •Software Engineer @ Hortonworks • remote worker • Open source prosumer • PMC member of Apache Storm • Committer of Jedis • Contributor of Apache (Spark, Zeppelin,Ambari, Calcite), Redis, and so on. • Contact: kabhwan@gmail.com
  • 3.
  • 4.
    • Performance <Easy to use • Innovation of technology is making the thing easier and easier to end-users
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    STREAMING SQL • Unboundedreal time data • can’t be fully covered in SQL standard and requires new ideas • No standard yet • Apache Calcite proposes its own Streaming SQL • https://calcite.apache.org/docs/stream.html • aggregation and stream-relation, stream-stream join is done within window • most of things are not implemented yet
  • 10.
    COMPARISON Processing unit SQL-like API SQL Streaming SQL Status ApacheFlink Tuple O O Not yet Experimental Apache Storm Micro-batch (Trident based) X O (Pure style) Not yet Experimental Apache Spark Micro-batch O O Only with Structured Streaming API Alpha
  • 11.
    SIMPLE USE CASE 1.Get JSON from Kafka 2. Filter error logs (status >= 400) 3. Project columns with user defined function and calculations 4. Store rows back to Kafka
  • 12.
    STORM SQL STATEMENTS CREATEFUNCTION GET_TIME AS 'org.apache.storm.sql.runtime.functions.scalar.datetime.GetTime2' CREATE EXTERNAL TABLE APACHE_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_urlVARCHAR, request_methodVARCHAR, statusVARCHAR, request_header_user_agentVARCHAR, time_received_utc_isoformatVARCHAR, time_us DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apachelogs' TBLPROPERTIES '{"producer": {"bootstrap.servers":"localhost: 9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}' CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_url VARCHAR, request_methodVARCHAR, status INT, request_header_user_agentVARCHAR, time_received_utc_isoformat VARCHAR, time_received_timestamp BIGINT, time_elapsed_ms INT) LOCATION 'kafka://localhost:2181/brokers? topic=apacheerrorlogs' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost: 9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}' INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD, CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,TIME_RECEIVED_UTC_ISOFORMAT, GET_TIME(TIME_RECEIVED_UTC_ISOFORMAT, 'yyyy-MM-dd''T''HH:mm:ssZZ') AS TIME_RECEIVED_TIMESTAMP, (TIME_US / 1000) ASTIME_ELAPSED_MS FROM APACHE_LOGS WHERE (CAST(STATUS AS INT) / 100) >= 4
  • 14.
  • 15.
  • 16.
    PROPOSAL WINDOWING SELECT STREAM TUMBLE_END(rowtime, INTERVAL '1'HOUR) AS rowtime, productId FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), productId HAVING COUNT(*) > 2 OR SUM(units) > 10; rowtime productId 10:00:00 30 11:00:00 10 11:00:00 40
  • 17.
    PROPOSAL STREAMTO RELATION JOIN SELECTSTREAM o.rowtime, o.productId, o.orderId, o.units, p.name, p.unitPrice FROM Orders AS o JOIN Products AS p ON o.productId = p.productId; rowtime productI d orderId units name unitPrice 10:17:00 30 5 4 Cheese 17 10:17:05 10 6 1 Beer 0.25 10:18:05 20 7 2 Wine 6 10:18:07 30 8 20 Cheese 17 11:02:00 10 9 6 Beer 0.25 11:04:00 10 10 1 Beer 0.25 11:09:30 40 11 12 Bread 100 11:24:11 10 12 4 Beer 0.25
  • 18.
    PROPOSAL STREAMTO RELATION JOIN(CONT.) SELECT STREAM * FROM Orders AS o JOIN ProductVersions AS p ON o.productId = p.productId AND o.rowtime BETWEEN p.startDate AND p.endDate; - ProductVersions is a temporal versioned table - unit price of product 10 is increased to 0.35 at 11:00 rowtime productId orderId units productId 1 name unitPrice 10:17:00 30 5 4 30 Cheese 17 10:17:05 10 6 1 10 Beer 0.25 10:18:05 20 7 2 20 Wine 6 10:18:07 30 8 20 30 Cheese 17 11:02:00 10 9 6 10 Beer 0.35 11:04:00 10 10 1 10 Beer 0.35 11:09:30 40 11 12 40 Bread 100 11:24:11 10 12 4 10 Beer 0.35
  • 19.
    PROPOSAL STREAMTO STREAM JOIN SELECTSTREAM o.rowtime, o.productId, o.orderId, s.rowtime AS shipTime FROM Orders AS o JOIN Shipments AS s ON o.orderId = s.orderId AND s.rowtime BETWEEN o.rowtime AND o.rowtime + INTERVAL '1' HOUR; rowtime productId orderId shipTime 10:17:00 30 5 10:55:00 10:17:05 10 6 10:20:00 11:02:00 10 9 11:58:00
  • 20.
  • 21.
    GUI Drag and drop,configure, done! img source: https://community.hortonworks.com/articles/8422/visualize-near-real-time-stock-price-changes-using.html
  • 22.