Mic Hussey, Senior Systems Engineer, Confluent
Using Kafka to integrate DWH and Cloud Based big data systems
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/268636234/
Using Kafka to integrate DWH and Cloud Based big data systems
1. 1C O N F I D E N T I A L
Using Kafka to integrate DWH and
Cloud Based big data systems.
Mic Hussey, Confluent Nordics, mic@confluent.io
2. 2C O N F I D E N T I A L
Apache Kafka, the de-facto OSS standard for
event streaming
Real-time | Uses disk structure for constant performance at Petabyte scale
Scalable | Distributed, scales quickly and easily without downtime
Persistent | Persists messages on disks, enables intra-cluster replication
Reliable | Replicates data, auto balances consumers upon failure
In production at more
than a third of the
Fortune 500
2 trillion messages a
day at LinkedIn
500 billion events a
day (1.3 PB) at Netflix
4. 4C O N F I D E N T I A L 4C O N F I D E N T I A L
Data Warehouses to Big Data
5. 5C O N F I D E N T I A L
Kafka Integration Architecture
Apps Apps Apps
Apps Apps Apps
Apps Apps Apps
Apps Apps Apps
Apps
Search
NoSQL
Apps
Apps
DWH
Hado
STREAM
ING
PLATFORM
Apps
Search
NoSQL
Apps
DWH
STREAMING
PLATFORM
PRODUCERCONSUMER
6. 6C O N F I D E N T I A L
Sample UseCase: Sales data
● Dataset from Kaggle https://www.kaggle.com/kyanyoga/sample-sales-data
7. 7C O N F I D E N T I A L
DWH
● Current de-facto
data integration
technology
● Third Normal Form
● Minimises data
duplication
● Star schema
8. 8C O N F I D E N T I A L 8
Big Data
● Data storage is
cheap
● Tabular data
● Flat schema
11. 11C O N F I D E N T I A L
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
12. 12C O N F I D E N T I A L
1
2
KSQLis the
Streaming
SQL Enginefor
Apache Kafka
13. 13C O N F I D E N T I A L
CREATE STREAM oob_readings AS
SELECT *, c.std_value, c.sigma
FROM sensor_reading s
LEFT JOIN sensor_characteristics c
ON s.id = c.id
WHERE abs(s.value – c.std_value) > 3*c.sigma;
Simple SQL syntax for expressing reasoning along and
across data streams.
You can write user-defined functions in Java
14. 14C O N F I D E N T I A L
Streaming KSQL: pairwise joins
15. 15C O N F I D E N T I A L
Streaming KSQL: pairwise joins
16. 16C O N F I D E N T I A L
Streaming KSQL: pairwise joins
17. 17C O N F I D E N T I A L
Streaming KSQL: pairwise joins
18. 18C O N F I D E N T I A L
Example code
● Template methods for invoking Kafka Connect REST APIs
● https://github.com/MichaelHussey/kafka_connect_curler
● 3rd Normal Form to Cloud
● https://github.com/MichaelHussey/dwh2cloud
19. 19C O N F I D E N T I A L
Publish master data to a topic
20. 20C O N F I D E N T I A L
Flow through a compacted topic