Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data integration with Apache Kafka


Published on

A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.

Published in: Technology
  • Be the first to comment

Data integration with Apache Kafka

  1. 1. 1Confidential Streaming Data Integration with Apache Kafka Presented by: David Tucker | Dir. Partner Engineering
  2. 2. 3Confidential Today’s Discussion • The evolving world of data integration • Design Considerations • The Kafka Solution • Kafka Connect • Logical Architecture • Core Components and Execution Model • Connector Examples • Wrap Up and Questions
  3. 3. 4Confidential
  4. 4. 5Confidential
  5. 5. 6Confidential
  6. 6. 7Confidential
  7. 7. 11Confidential
  8. 8. 12Confidential Explosion of Operational Data Stores and Processing Frameworks
  9. 9. 13Confidential Abstract View: Many Ad Hoc Pipelines Search Security Fraud Detection Application User Tracking Operational Logs Operational Metrics Hadoop App Data Warehouse Espresso Cassandra Oracle Databases Storage Interfaces Monitoring App Databases Storage Interfaces
  10. 10. 14Confidential Re-imagined Architecture: Streaming Platform with Kafka ü Distributed ü Fault Tolerant ü Stores Messages Search Security Fraud Detection Application User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle Hadoop App Monitoring App Data Warehouse Kafka ü Processes Streams Kafka StreamsKafka Streams
  11. 11. 16Confidential Design Considerations : These Things Matter • Reliability and Delivery Semantics – Losing data is (usually) not OK. • Exactly Once vs At Least Once vs (very rarely) At Most Once • Timeliness • Push vs Pull • High Throughput, Varying Throughput • Compression, Parallelism, Back Pressure • Data Formats • Flexibility, Structure • Security • Error Handling
  12. 12. 18Confidential 1 8
  13. 13. 19Confidential 1 9
  14. 14. 20Confidential Introducing Kafka Connect Simplified, scalable data integration via Apache Kafka
  15. 15. 21Confidential
  16. 16. 22Confidential Kafka Connect : Separation of Concerns
  17. 17. 23Confidential Kafka Connect: Logical Model Kafka Connect Apache Kafka Brokers Schema Registry
  18. 18. 24Confidential How is Connect different than a producer or consumer? • Producers and consumers enable total flexibility; data is published and processed in any way • This flexibility means you do everything yourself. • Kafka Connect’s simple framework allows : • developers to create connectors that copy data to/from other systems • operators/users to use said connectors just by writing configuration files and submitting them to Connect -- no code necessary • community and 3rd-party engineers to build reliable plugins for common data sources and sinks • deployments to deliver scalability, fault tolerance and automated load balancing out-of- the-box
  19. 19. 25Confidential
  20. 20. 26Confidential
  21. 21. 27Confidential
  22. 22. 28Confidential
  23. 23. 29Confidential
  24. 24. 30Confidential
  25. 25. 31Confidential
  26. 26. 32Confidential
  27. 27. 34Confidential Connector Hub: • Confluent-supported connectors (included in CP) • Partner/Community-written connectors (just a sampling) JDBC
  28. 28. 35Confidential Kafka Connect Example: MySQL to Hive Pipeline • Blog at
  29. 29. 36Confidential MySQL to Hive Pipeline : Step by Step • Configure the JDBC Source Connector with the MySQL details • User authentication • Tables to replicate; polling interval for change-data-capture • Configure the HDFS Sink Connector with Hadoop Details • Target HDFS directory • Hive metastore details • Partitioning details (optional) • Watch it go !!! • What you can’t see • Source and Sink scalability • Table metadata changes are captured in Schema Registry
  30. 30. 37Confidential
  31. 31. 38Confidential Thank You Questions ?