4. 4Confidential
Streaming Data vs Big Data
Stream Data is
The Faster the Better
Stream Data can be
Big or Fast (Lambda)
Stream Data will be
Big AND Fast (Kappa)
Apache Kafka is the Enabling Technology of this Transition
Big Data was
The More the Better
ValueofData
Age of Data Speed Table Batch Table
Database
Streams Hadoop
Job 1 Job 2
Streams
Table 1 Table 2
Database
ValueofData
Volume of Data
8. 8Confidential
The Streams API of Apache Kafka™
✓ No separate processing cluster required
✓ Develop on Mac, Linux, Windows
✓ Deploy to containers, VMs, bare metal, cloud
✓ Powered by Kafka: elastic, scalable, distributed,
battle-tested
✓ Perfect for small, medium, large use cases
✓ Fully integrated with Kafka security
✓ Exactly-once processing semantics
✓ Part of Apache Kafka, included in
Confluent Open Source
Write standard Java applications and microservices
to process your data in real-time
KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic");
KTable<Windowed<User>, Long> viewsPerUserSession = pageViews
.groupByKey()
.count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views");
http://kafka.apache.org/documentation/streams
14. 14Confidential
Why KSQL?
• Expand access to Kafka Stream Processing to more people
• More accessible
• Less intimidating
• Lower the barriers to entry
Benefits
• Enable stream processing with zero coding required
• The simplest way to process streams of data in real-time
• Powered by Kafka: scalable, distributed, battle-tested
15. 15Confidential
On the Shoulders of (Streaming) Giants
• Native, 100%-compatible Kafka integration
• Secure stream processing using Kafka’s security features
• Elastic and highly scalable
• Fault-tolerant
• Stateful and stateless computations
• Interactive queries
• Time model
• Supports late-arriving and out-of-order data
• Windowing
• Millisecond processing latency, no micro-batching
• At-least-once and exactly-once processing guarantees
16. 16Confidential
KSQL Concepts
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream
• One record per key (per window)
• Current values (compacted topic)
• Changelog
● STREAM – TABLE Joins
17. 17Confidential
Schema & Format
●Start with message (value) format
● JSON - the simplest choice
● DELIMITED - in this preview, the implicit delimiter is a comma and the escaping rules are built-in. Will be
expanded.
● AVRO - requires that you also supply a schema-file (.avsc), Schema Registry support soon!
●Pseudo-columns are automatically provided
• ROWKEY, ROWTIME - for querying the message key and timestamp
• (PARTITION, OFFSET coming soon)
• CREATE STREAM pageview (viewtime bigint, userid varchar, pageid varchar) WITH
(value_format = 'delimited', kafka_topic='my_pageview_topic');
18. 18Confidential
Interactive Querying
● Great for iterative development
● LIST (or SHOW) STREAMS / TABLES
● DESCRIBE STREAM / TABLE
● SELECT
• Selects rows from a KSQL stream or table.
• The result of this statement will be printed out in the console.
• To stop the continuous query in the CLI press Ctrl+C.
19. 19Confidential
What is it for ?
● Streaming ETL
○ Kafka is popular for data pipelines.
○ KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM
clickstream c LEFT JOIN users u ON
c.userid = u.user_id
WHERE u.level = 'Platinum';
21. 21Confidential
What is it for ?
● Anomaly Detection
○ Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5
SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
22. 22Confidential
What is it for ?
● Real Time Monitoring
○ Log data monitoring, tracking and alerting
○ Sensor / IoT data
CREATE TABLE error_counts
AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1
MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
23. 23Confidential
What is it for ?
● Simple Derivations of Existing Topics
○ One-liner to re-partition and/or re-key a topic for new uses
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS
SELECT *
FROM clickstream
PARTITION BY user_id;
24. 24Confidential
KSQL Components
• CLI
• Designed to be familiar to users of MySQL, Postgres, etc
• Engine
• Actually runs the Kafka Streams topologies
• REST Server
• HTTP interface allows an Engine to receive instructions from the CLI
25. 25Confidential
How to run KSQL - #1 Stand-alone aka ‘local mode’
• Starts a CLI, an Engine, and a REST server all in the same JVM
• Ideal for laptop development
• Start with default settings:
> bin/ksql-cli local
• Or with customized settings:
> bin/ksql-cli local –-properties-file foo/bar/ksql.properties
26. 26Confidential
How to run KSQL - #2 Client-Server
• Start any number of Server nodes
• > bin/ksql-server-start
• Start any number of CLIs and specify ‘remote’ server address
• >bin/ksql-cli remote http://myserver:8090
• All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up/down without restart
28. 28Confidential
How to run KSQL - #3 as an Application
• Start any number of Engine instances
• Pass a file of KSQL statements to execute
> bin/ksql-node foo/bar.sql
• Ideal for streaming ETL application deployment
• Version control your queries and transformations as code
• All running Engines share the processing load
• Technically, instances of the same Kafka Streams Applications
• Scale up/down without restart