Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka connect

587 views

Published on

Kafka Connect, Scalable, Fault tolerant ETL for streams.

Published in: Technology
  • Be the first to comment

Kafka connect

  1. 1. Agenda Me and DataMountaineer Data Integration Kafka Connect Demo
  2. 2. Andrew Stevenson Lead Mountain Goat at DM Solution Architect - Fast Data - Big Data as long as it’s Fast Contributed to - Kafka Connect, Sqoop, Kite SDK
  3. 3. Kafka Connect Connectors 20+ connectors. DataStax Certified Cassandra Sink Professional Services Implementations, Architecture reviews DevOps & Tooling Connector Support
  4. 4. Partners
  5. 5. Data Integration Loading and unloading data should be easy
  6. 6. But it takes too long (certainly on hadoop)
  7. 7. Why? Enterprise pipelines must consider: Delivery semantics Offset management Serialization / de-serialization Partitioning / scalability Fault tolerance / failover Data model integration CI/CD Metrics / monitoring
  8. 8. Which results in? Multiple technologies - Bash wrappers on Sqoop - Oozie Xml  - Custom Java/Scala/C# - Third Party - Multiple teams hand roll similar solutions Lack of separation of concerns - Extract/loading ends up domain specific
  9. 9. What we really care about… DOMAIN SPECIFIC TRANSFORMATIONS Focus on adding value
  10. 10. Kafka Connect? ✓Delivery semantics ✓Offset management ✓Serialization / de-serialization ✓Partitioning / scalability ✓Fault tolerance / fail-over ✓Data model integration ✓Metrics Out of the Box – ONE FRAMEWROK Lets you focus on domain logic
  11. 11. Kafka Connect “a common framework facilitating data streams between kafka and other systems”
  12. 12. Ease of use  deploy flows via configuration files with no code necessary Out of the box & Community
  13. 13. Configurations are key-value mappings name connector’s unique name connector.class connector’s class max.tasks maximum tasks to create Option[topics] list of topics (for sinks)
  14. 14. Config Example name = kudu-sink connector.class = KuduSinkConnector tasks.max = 1 topics = kudu_test connect.kudu.master = quickstart connect.kudu.sink.kcql = INSERT INTO KuduTable SELECT * FROM kudu_test
  15. 15. KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors and then some more.. Example: Project fields, rename or ignore them and further customise in plain text INSERT INTO transactions SELECT field1 AS column1, field2 AS column2 FROM TransactionTopic; INSERT INTO audits SELECT * FROM AuditsTopic; INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE; INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
  16. 16. KCQL | { "sensor_id": "01" , "temperature": 52.7943, "ts": 1484648810 } { “sensor_id": "02" , "temperature": 28.8597, "ts": 1484648810 } INSERT INTO sensor_ringbuffer SELECT sensor_id, temperature, ts FROM coap_sensor_topic WITHFORMAT JSON STOREAS RING_BUFFER INSERT INTO sensor_reliabletopic SELECT sensor_id, temperature, ts FROM coap_sensor_topic WITHFORMAT AVRO STOREAS RELIABLE_TOPIC
  17. 17. Deeper into Connect Modes Workers Connectors Tasks Converters
  18. 18. Modes Standalone - single node, for testing, one off import/exports Distributed - 1 or more workers on 1 or more servers form a cluster/consumer group
  19. 19. Interaction Standalone - properties file at start up Distributed - rest API  ./cli - create  ./cli create conn < conn.conf - start  ./cli start conn < conn.conf - stop  ./cli stop conn - restart  ./cli restart conn - pause  ./cli pause conn - remove  ./cli rm conn - status  ./cli ps - plugins  ./cli plugins - validate  ./cli validate < conn.conf
  20. 20. Workers group.id = cluster1 group.id = cluster1 group.id = cluster1 * Kafka Consumer Group Kafka
  21. 21. Connectors Define the how - which plugin to use, must be on CLASSPATH - breaks up work into tasks - multiple per cluster, unique name - fault tolerant
  22. 22. Happy flow Rest API W1 W2 W3 C1 C1 T1 Config topic C1 T1 C1 T2 C1 T3 Put * Coordinator
  23. 23. Unhappy flow W1 W2 W3 C1 C1 T1 Config topic C1 T1 C1 T2 C1 T3 * ReBalanceListener
  24. 24. Connector API class CoherenceSinkConnector extends SinkConnector { override def taskClass(): Class[_ <: Task] override def start(props: util.Map[String, String]): Unit override def taskConfigs(maxTasks: Int): util.List[util.Map[String, String]] override def stop(): Unit }
  25. 25. Tasks Perform the actual work - loading / unloading - single threaded Used to Scale - more task more parallelism - kafka consumer group managed - if (tasks > partitions) => idle tasks Contains no state, it’s in Kafka - started - stopped - restarted - pause
  26. 26. Task API class CoherenceSinkTask extends SinkTask { override def start(props: util.Map[String, String]): Unit override def stop(): Unit override def flush(offsets: util.Map[TopicPartition, OffsetAndMetadata]) override def put(records: util.Collection[SinkRecord]) }
  27. 27. Converters JsonConverter - ships with Kafka AvroConverter - ships with Confluent - integrates with Schema Registry
  28. 28. kafka-connect-blockchain kafka-connect-bloomberg kafka-connect-cassandra kafka-connect-coap kafka-connect-druid kafka-connect-elastic kafka-connect-ftp kafka-connect-hazelcast kafka-connect-hbase kafka-connect-influxdb kafka-connect-jms kafka-connect-kudu kafka-connect-mongodb kafka-connect-mqtt kafka-connect-redis kafka-connect-rethink kafka-connect-voltdb kafka-connect-yahoo Source: https://github.com/datamountaineer/stream-reactor Integration Tests: http://coyote.landoop.com/connect/
  29. 29. Connect UI Schema UI Topic UI Dockers Cloudera CSD
  30. 30. Kafka Connect Scalable Fault tolerant Common framework Does the hard for work you
  31. 31. DEMO

×