Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

1,697 views

Published on

Speaker: Robin Moffatt, Developer Advocate, Confluent

In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.

This is part 2 of 3 in Streaming ETL - The New Data Integration series.

Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.

Published in: Technology
  • Be the first to comment

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

  1. 1. @rmoff robin@confluent.io Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Robin Moffatt, Developer Advocate
  2. 2. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 2 $ whoami • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : https://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff
  3. 3. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 3 Housekeeping Items ● This session will last about an hour. ● This session will be recorded. ● You can submit your questions by entering them into the GoToWebinar panel. ● The last 10-15 minutes will consist of Q&A. ● The slides and recording will be available after the talk.
  4. 4. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Streaming ETL with Apache Kafka and KSQL
  5. 5. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 5 Database offload Hadoop/Object Storage/Cloud DW for Analytics HDFS / S3 / BigQuery etc RDBMS
  6. 6. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 6 Streaming ETL with Apache Kafka and KSQL order items customer customer orders Stream Processing RDBMS
  7. 7. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 7 Real-time Event Stream Enrichment with Apache Kafka and KSQL order events customer Stream Processing customer orders RDBMS <y>
  8. 8. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x>
  9. 9. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x>
  10. 10. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
  11. 11. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
  12. 12. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 12 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3, syslog ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema
  13. 13. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 13 Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON MQTT MQTT
  14. 14. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 14 Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Chucking data over the fence into a Kafka topic is not enough • We need standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats
  15. 15. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 15 Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Confluent Schema Registry & Avro is a great way to do this • Downstream users of the data can then easily use the data • KSQL • Kafka Connect • Kafka Streams • Custom apps
  16. 16. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 16 The Confluent Schema Registry MySQL Avro Message Elasticsearch Schema RegistryAvro Schema Kafka Connect Kafka ConnectAvro Message
  17. 17. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 17 The Confluent Schema Registry Source (MySQL) schema is preserved Target (Elasticsearch) schema mapping is automagically built
  18. 18. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 18 Integrating Databases with Kafka • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific. Read more: http://cnfl.io/kafka-cdc
  19. 19. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column 19 Query-based CDC SELECT * FROM my_table 
 WHERE col > <value of col last time we polled> Read more: http://cnfl.io/kafka-cdc
  20. 20. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 20 Log-based CDC • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Read more: http://cnfl.io/kafka-cdc
  21. 21. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 21 Query-based vs Log-based CDC Photo by Matese Fields on Unsplash • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes Read more: http://cnfl.io/kafka-cdc
  22. 22. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 22 Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc
  23. 23. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 23 Which Log-Based CDC Tool? For query-based CDC, use the Confluent Kafka Connect JDBC connector • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Attunity • IBM InfoSphere Data Replication • Oracle GoldenGate • SQData • HVR Read more: http://cnfl.io/kafka-cdc All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry
  24. 24. “ @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL But I need to join…aggregate…filter…
  25. 25. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Declarative Stream Language Processing KSQLis a
  26. 26. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQLis the Streaming SQL Engine for Apache Kafka
  27. 27. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”
  28. 28. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  29. 29. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  30. 30. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  31. 31. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT='AVRO', TIMESTAMP='view_time') AS 
 SELECT * FROM clickstream PARTITION BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line
  32. 32. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL DEMO!
  33. 33. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL MySQL DebeziumKafka Connect Producer API Elasticsearch Kafka Connect
  34. 34. 34 Questions? http://confluent.io/ksql https://slackpass.io/confluentcommunity

×