Streaming etl in practice with postgre sql, apache kafka, and ksql mic
1. Streaming ETL in Practice
with PostgreSQL, Apache
Kafka, and KSQL
SPI-NL 2018
11 Oct 2018 / Mic Hussey
@hussey_mic mic@confluent.io
2. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 2
• Systems Engineer @ Confluent
• Working in messaging/event processing since 1998
• GitHub: https://github.com/MichaelHussey
• Twitter: @hussey_mic
$ whoami
3. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 3
App App App App
search
HadoopDWH
monitoring security
MQ MQ
cache
cache
A bit of a mess…
4. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 4
The Streaming Platform
KAFKA
DWH Hadoop
App
App App App App
App
App
App
request-response
messaging
OR
stream
processing
streaming data pipelines
changelogs
5. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 5
Database offload → Analytics
HDFS / S3 /
BigQuery etc
RDBM
CDC
6. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 6
Streaming ETL with Apache Kafka and KSQL
order events
customer
customer orders
Stream
Processing
RDBM CDC
7. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 7
Real-time Event Stream Enrichment
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
CDC
8. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 8
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
New App
<x>
CDC
9. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 9
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
HDFS / S3 / etc
New App
<x>
CDC
10. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 10
KSQL
Streaming ETL with Apache Kafka
11. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 11
Streaming Integration with Kafka Connect
Kafka Brokers
Kafka Connect
Tasks Workers
Sources Sinks
Amazon S3
syslog
flat file
CSV
JSON
12. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 12
The Connect API of Apache Kafka®
✓ Fault tolerant and automatically load balanced
✓ Extensible API
✓ Single Message Transforms
✓ Part of Apache Kafka, included in
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
✓ Centralized management and configuration
✓ Support for hundreds of technologies
including RDBMS, Elasticsearch, HDFS, S3
✓ Supports CDC ingest of events from RDBMS
✓ Preserves data schema
13. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 13
Integrating Postgres with Kafka
Kafka Connect
Kafka Connect
14. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 14
Confluent Hub
hub.confluent.io
• Launched June 2018
• One-stop place to discover and
download :
• Connectors
• Transformations
• Converters
15. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 15
KSQL
Streaming ETL with Apache Kafka
16. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018
Declarative
Stream
Language
Processing
KSQLis a
17. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018
KSQLis the
Streaming
SQL Enginefor
Apache Kafka
18. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL
KSQL for Streaming ETL
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Joining, filtering, and aggregating streams of event data
19. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL
KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
20. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL
KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• syslog data
• Sensor / IoT data
CREATE STREAM SYSLOG_INVALID_USERS AS
SELECT HOST, MESSAGE
FROM SYSLOG
WHERE MESSAGE LIKE '%Invalid user%';
http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
21. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 21
KSQL
Streaming ETL with Apache Kafka
23. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 23
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
POOR_RATINGS
Filter all ratings where STARS<3
CREATE STREAM POOR_RATINGS AS
SELECT * FROM ratings WHERE STARS <3
24. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 24
Kafka Connect
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
{
"id": 3,
"first_name": "Merilyn",
"last_name": "Doughartie",
"email": "mdoughartie1@dedecms.com",
"gender": "Female",
"club_status": "platinum",
"comments": "none"
}
RATINGS_WITH_CUSTOMER_D
Join each rating to customer data
UNHAPPY_PLATINUM_CUSTO
Filter for just PLATINUM customers
CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
SELECT * FROM RATINGS_WITH_CUSTOMER_DATA
WHERE STARS < 3
25. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 25
Kafka Connect
Producer API
{
"rating_id": 5313,
"user_id": 3,
"stars": 4,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "worst. flight. ever. #neveragain"
}
{
"id": 3,
"first_name": "Merilyn",
"last_name": "Doughartie",
"email": "mdoughartie1@dedecms.com",
"gender": "Female",
"club_status": "platinum",
"comments": "none"
}
RATINGS_WITH_CUSTOMER_D
Join each rating to customer data
RATINGS_BY_CLUB_STATUS_1
Aggregate per-minute by CLUB_STATUS
CREATE TABLE RATINGS_BY_CLUB_STATUS AS
SELECT CLUB_STATUS, COUNT(*)
FROM RATINGS_WITH_CUSTOMER_DATA
WINDOW TUMBLING (SIZE 1 MINUTES)
GROUP BY CLUB_STATUS;
26. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 26
Confluent Open Source :
Apache Kafka with a bunch of cool stuff! For free!
Database Changes Log Events loT Data Web Events …
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka®
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | CLI
Apache Open Source Confluent Open Source Confluent Enterprise
SQL Stream Processing
KSQL
27. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 27
Free Books!
https://www.confluent.io/apache-kafka-stream-processing-book-bundle
29. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 29
• Postgres integration into Kafka
• http://debezium.io/docs/connectors/postgresql/
• https://www.simple.com/engineering/a-change-data-capture-pipeline-from-postgresql-to-kafka
• https://www.slideshare.net/JeffKlukas/postgresql-kafka-the-delight-of-change-data-capture
• https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a
• Streaming ETL
• Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides
• Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL
• Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides
• https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
• https://github.com/confluentinc/ksql/
Useful links
30. @hussey_mic / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - SPI-NL 2018 30
• CDC Spreadsheet
• Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
• #partner-engineering on Slack for questions
• BD team (#partners / partners@confluent.io) can help with introductions on a given sales op
Resources
#EOF