5. Two definitions of Stream SQL
1. Run a continuous SQL query that reads an infinite
stream and continuously produces results
2. Continuously ingest streams into a warehouse.
Query the real time data in the warehouse.
5
6. Two definitions of Stream SQL
1. Run a continuous SQL query that reads an infinite
stream and continuously produces results
2. Continuously ingest streams into a warehouse.
Query the real time data in the warehouse.
6
That's Flink's Stream SQL
Good use case for Kafka + Flink + Druid
7. An Example
7
val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(execEnv)
// define a JSON encoded Kafka topic as external table
val sensorSource = new KafkaJsonSource[(String, Long, Double)]("sensorTopic", kafkaProps,
("location", "time", "tempF"))
// register external table
tableEnv.registerTableSource("sensorData", sensorSource)
// define query in external table
val roomSensors: Table = tableEnv.sql("""
SELECT STREAM time, location AS room, (tempF - 32) * 0.556 AS tempC
FROM sensorData
WHERE location LIKE 'room%' """)
// write the table back to Kafka as JSON
roomSensors.toSink(new KafkaJsonSink(...))
10. Sharing State with Applications
10
Access to the stream aggregates with a latency bound
Write them to a key/value store
11. Sharing State with Applications
11
Access to the stream aggregates with a latency bound
Write them to a key/value store
Often the biggest
bottleneck
13. What does it bring?
Fewer moving parts in the infrastructure
Performance!
From an extension of Yahoo!'s streaming benchmark:
• With key/value store: 280,000 events/s
• Queryable state: 15,000,000 events/s
What's the secret?
• No synchronous distributed communication
• Persistence via Flink's checkpoint (async snapshots)
13
15. Adjust parallelism of Streaming Programs
15
Initial
configuration
Scale Out
(for load)
Scale In
(save resources)
16. Adjust parallelism of Streaming Programs
Adjusting parallelism without (significantly) interrupting the
program
Initial version:
• Savepoint -> stop -> restart-with-different-parallelism
Stateless operators: Trivial
Stateful operators: Repartition state
• State reorganized by key for key/value state and windows
16
19. Redistribution via Key Groups
Flink 1.0: Hash keys into parallel partitions.
Finest granularity is a partition.
Flink 1.1: Hash keys into KeyGroups.
Assign KeyGroups to parallel partitions
Change of parallelism means change of assignment of
KeyGroups to parallel partitions
19
20. Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org