London Apache Kafka Meetup (Jan 2017)


Published on

Landoop presenting how to simplify your ETL process using Kafka Connect for (E) and (L). Introducing KCQL - the Kafka Connect Query Language & how it can simplify fast-data (ingress & egress) pipelines. How KCQL can be used to set up Kafka Connectors for popular in-memory and analytical systems and live demos with HazelCast, Redis and InfluxDB. How to get started with a fast-data docker kafka development environment. Enhance your existing Cloudera (Hadoop) clusters with fast-data capabilities.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thank you very much for coming today. I will be delivering a talk about building Fast Data systems with Kafka
  • My name is Antonios. I’ve been involved with open-source projects on distributed systems of the Hadoop eco-system, and currently, i’m having Apache Kafka in my heart :)

    I have authored a book on MapReduce using Scalding and co-authored another one
  • Landoop is a company starting-up and focusing on DevOps, Distributed Systems and particularly Apache Kafka
  • Today i’d like to start the presentation with Kafka Connect. I guess most of you are already familiar with it, so will give a quick overview
  • Kafka Connect was introduced almost one year ago, as a feature of Apache Kafka 0.9+ with the narrow (although very important) scope of copying streaming data from and to a Kafka cluster. I found the concept really interesting and decided to experiment with it to see what this framework introduces.

    Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.

    The announcement by confluent
  • And this is how Kafka Connect fits into the picture on a Kafka based system.
    You would normally use a stream processing framework to transform your data streams i.e. Spark, K-Streams, etc
  • And what Kafka Connect offers is the separation of concerns. It can simplify the key stages of the ETL process, and using simple tools, we can build and maintain distributed streaming data pipeline.

    The E (the extraction) and the L (the load) can be taken care for you, and then as a developer you can focus on the T (the transformations)

    By combining Kafka Connect and stream processing engines we can perform streaming ETLs. Each does the job it is best at, and Kafka acts as the underlying data storage layer that supports them and allows simple integration with a variety of other applications.
  • By using a robust framework that delivers scalability and fault tolerance out-of-the-box we can then focus on extracting value in a transformation layer.

    deployments to deliver fault tolerance and automated load balancing

    As you can see here, the basic pattern is to use Kafka Connect to perform Extraction of the data and load it into Kafka as a temporary, scalable, fault tolerant streaming data store. While you can do this with other, more generic data copy tools, you’ll commonly lose important semantics such as at least once delivery of data. Once the data is extracted, you use stream processing engines to perform Transformation and either this is the endpoint for the data or you can deliver it back into Kafka. Finally, Load the data with another Kafka connector into the destination system. Obviously this is a simplified picture and your pipeline will grow much more complex, have multiple stages of transformation (especially if the intermediate data is useful for reuse by multiple applications, including anything downstream that may not be processed by stream processing engines).
  • Most configurations are connector dependent, but there are a few settings common to all connectors
  • What we are introducing to all our Kafka connectors is the KCQL query
  • Let’s look at some of the more advanced features of KCQL - and in particular regarding some sinks.
  • Hazelcast for example supports the Ring Buffer Data structure, which is quite popular from the Disruptor pattern. Data can be pushed in a fixed-size buffer, with a particular retention period. If the buffer gets filled - an eviction policy will be triggered - to either evict oldest records, or deny the addition of new records.

    So to write some IoT data from a Kafka topic into a Ring Buffer - we can use the STOREAS keyword.

    On the right side, you can see how we can store the same data into a RELIABLE TOPIC - another hazel cast data-structure.

    *Hazelcast requires data to be serializable, and JSON and Avro are supported.
  • Redis provides the Sorted Set data structure. This structure allows only unique elements to be added - and each element is required to be scored - to enforce ordering.

    This data structure is oftenly used to preserve time-series data, as Redis allows running time-range queries.

    So if we have a Kafka topic with Foreign Exchange data, we can either
    -store all the messages into a SortedSet ( the one with the blue colour) OR -create a new SortedSet for each symbol ( one SortedSet for each currency rate ) using the PK syntax on the right
  • So this is a list of Apache 2.0 licensed Kafka Connectors that we have been working on.

    Blockchain, Bloomberg, the Cassandra connector that is certified by DataStax, a Constrained Application Protocol connector, Elastic Search, JMS, MQTT and others are some of the connectors already available, and released against the 2 latest releases of Apache Kafka.
  • So let’s see a DEMO in real-time
  • So let’s see a DEMO in real-time
  • So let’s see a DEMO in real-time
  • So let’s see a DEMO in real-time
  • So let’s see a DEMO in real-time
  • Connectors look overall simple - and i know a number of people in this room already using them in production. So how does performance look like ?

    This image above demonstrates that depending on the sink system - we can sink 50 K records / sec by using:
    20 partitions
    3 connect tasks
    5 GB RAM / connector
    less than 2 CPUs

    On the bottom-left corner - we can see that we have saturated 50% of the available network bandwidth.

    Depending on the number of tasks and partitions - we can easily increase sink performance to more than 100K records / sec.

    The lesson regarding performance is that:
    Kafka Connect can scale really well
    It requires quite some memory
    and quite some CPUs especially if batching writes
  • We have also send Pull Requests to the prometheus team - to enable GZIP compression - to minimise any impact in the running system, something that has significantly decreased the network i/o

    We then provide pre-built DashBoards on Grafana
    We are using Grafana version 4.0 released a few months ago - that allows alerting that is a really revolutionary feature as it transforms Grafana from a visualisation tool into a truly mission critical monitoring tool

    We’ll have a demo, but before going into it ..
  • Before doing a Live presentation - i’d like to answer a question :

    How do i ship such a complex infrastructure that can easily grow into Hundreds of running services ?

    We preferably use:
    Deployment apps such as Ansible
    Docker based technologies for state-less micro-services
    CDH based integration with Cloudera Managed for CDH Hadoop clusters
  • CDH docs -
  • More connectors are added monthly
  • Time-Travel in Kafka topics and KCQL queries and real-time
  • London Apache Kafka Meetup (Jan 2017)

    1. 1. Delivering Fast Data Systems with Kafka LANDOOP Antonios Chalkiopoulos 18/1/2017
    2. 2. @chalkiopoulos Open Source contributor Big Data projects in Media, Betting, Retail and 
 Investment Banks in London Books Author, Programming MapReduce with Scalding 
 Founder of Landoop
    3. 3. DevOps Big Data Scala Automation Distributed Systems Monitoring Hadoop Fast Data / Streams Kafka
    4. 4. KAFKA CONNECT a bit of context
    5. 5. KAFKA CONNECT “a common framework for allowing stream data flow between kafka and other systems”
    6. 6. Data is produced from a source and consumed to a sink. Data Source KafkaConnect KafkaConnect KAFKA Data SinkData Source KafkaConnect KafkaConnect KAFKA Data Sink Stream processing
    7. 7. Data Source KafkaConnect KafkaConnect KAFKA Data Sink Stream processing E T L
    8. 8. Developers don’t care about:
 Move data to/from sink/source Support delivery semantics Offset Management Serialization / de-serialization Partitioning / Scalability Fault tolerance / fail-over Schema Registry integration Developers care about:
 Domain specific transformations
    9. 9. CONNECTORS Kafka Connect’s framework allows developers to create connectors that copy data to/from other systems just by writing configuration files and submitting them to Connect with no code necessary
    10. 10. Connector configurations are key-value mappings name connector’s unique name connector.class connector’s java class tasks.max maximum tasks to create topics list of topics (to source or sink data)
    11. 11. Introducing a query language for the connectors name connector’s unique name connector.class connector’s java class tasks.max maximum tasks to create topics list of topics (to source or sink data) query KCQL query specifies fields/actions for the target system
    12. 12. KCQL Kafka Connect Query Language is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors and then some more.. Example: Project fields, rename or ignore them and further customise in plain text INSERT INTO transactions SELECT field1 AS column1, field2 AS column2, field3 FROM TransactionTopic; INSERT INTO audits SELECT * FROM AuditsTopic; INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE; INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
    13. 13. So while integrating Kafka with in-memory data grid, key-value, document stores, NoSQL, search etc systems.. INSERT INTO $TARGET SELECT *|columns(i.e col1,col2 | col1 AS column1,col2) FROM $TOPIC_NAME [ IGNORE columns ] [ AUTOCREATE ] [ PK columns ] [ AUTOEVOLVE ] [ BATCH = N ] [ CAPITALIZE ] [ INITIALIZE ] [ PARTITIONBY cola[,colb] ] [ DISTRIBUTEBY cola[,colb] ] [ CLUSTERBY cola[,colb] ] [ TIMESTAMP cola|sys_current ] [ STOREAS $YOUR_TYPE([key=value, .....]) ] [ WITHFORMAT TEXT|AVRO|JSON|BINARY|OBJECT|MAP ] KCQL How does it look like?
    14. 14. Topic to target mapping Field selection Auto creation Auto evolution Error policies Multiple KCQLs / topic 
 - Field extraction
 - Access to Key & Metadata Why KCQL ?
    15. 15. KCQL Advanced Features Examples
    16. 16. KCQL | { "sensor_id": "01" , "temperature": 52.7943, "ts": 1484648810 } { “sensor_id": "02" , "temperature": 28.8597, "ts": 1484648810 } Example Kafka topic with IoT data INSERT INTO sensor_ringbuffer 
 SELECT sensor_id, temperature, ts 
 FROM coap_sensor_topic 
 STOREAS RING_BUFFER INSERT INTO sensor_reliabletopic 
 SELECT sensor_id, temperature, ts
 FROM coap_sensor_topic 
    17. 17. INSERT INTO FXSortedSet 
 SELECT symbol, price 
 FROM yahooFX-topic 
 STOREAS SortedSet(score=ts) SELECT price 
 FROM yahooFX-topic 
 PK symbol 
 STOREAS SortedSet(score=ts) KCQL | { "symbol": "USDGBP" , "price": 0.7943, "ts": 1484648810 } { "symbol": "EURGBP" , "price": 0.8597, "ts": 1484648810 } Example Kafka topic with FX data B:1 A:2 D:3 C:20 Sorted Set -> { value : score }
    18. 18. Stream reactor connectors support KCQL kafka-connect-blockchain kafka-connect-bloomberg kafka-connect-cassandra kafka-connect-coap kafka-connect-druid kafka-connect-elastic kafka-connect-ftp kafka-connect-hazelcast kafka-connect-hbase kafka-connect-influxdb kafka-connect-jms kafka-connect-kudu kafka-connect-mongodb kafka-connect-mqtt kafka-connect-redis kafka-connect-rethink kafka-connect-voltdb kafka-connect-yahoo Source: Integration Tests:
    19. 19. DEMO Kafka Connect InfluxDB We ‘ll need: • Zookeeper • Kafka Broker • Schema Registry • Kafka Connect Distributed • Kafka REST Proxy We ‘ll also use: • StreamReactor connectors • Landoop Fast Data Web Tools docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST= landoop/fast-data-dev case class DeviceMeasurements(
 deviceId: Int, temperature: Int, moreData: String, timestamp: Long) We’ll generate some Avro messages
    20. 20. DEMO Kafka Development Environment @ Fast-data-dev docker image
    21. 21. DEMO Integration testing with Coyote for connectors & infrastructure
    22. 22. Schema Registry UI
    23. 23. Kafka Topics UI
    24. 24. Kafka Connect UI
    25. 25. Connectors Performance
    26. 26. Monitoring & Alerting via JMX
    27. 27. Deployment
 apps Containers 
 mesos -kubernetes Hadoop 
 integration * state-less apps = container-friendly
 schema registry, kafka connect How do I IT? Available features: 
 Kafka ecosystem StreamReactor Connectors Landoop web tools Monitoring & Alerting Security features
    28. 28. Wrap up - KCQL - Connectors - Kafka Web Tools - Automation & Integrations
    29. 29. Coming up - Kafka backend enhanced UIs | Timetravel
    30. 30. $ locate
    31. 31. Thank you ;)