Landoop presenting how to simplify your ETL process using Kafka Connect for (E) and (L). Introducing KCQL - the Kafka Connect Query Language & how it can simplify fast-data (ingress & egress) pipelines. How KCQL can be used to set up Kafka Connectors for popular in-memory and analytical systems and live demos with HazelCast, Redis and InfluxDB. How to get started with a fast-data docker kafka development environment. Enhance your existing Cloudera (Hadoop) clusters with fast-data capabilities.
Developer Data Modeling Mistakes: From Postgres to NoSQL
London Apache Kafka Meetup (Jan 2017)
1. Delivering Fast Data Systems with Kafka
LANDOOP
www.landoop.com
Antonios Chalkiopoulos
18/1/2017
2. @chalkiopoulos
Open Source contributor
Big Data projects in Media, Betting, Retail and
Investment Banks in London
Books
Author, Programming MapReduce with Scalding
Founder of Landoop
3. DevOps Big Data Scala
Automation Distributed Systems Monitoring
Hadoop Fast Data / Streams Kafka
5. KAFKA CONNECT
“a common framework
for allowing stream data flow
between kafka and other systems”
6. Data is produced from a source and consumed to a sink.
Data Source
KafkaConnect
KafkaConnect
KAFKA Data SinkData Source
KafkaConnect
KafkaConnect
KAFKA Data Sink
Stream processing
8. Developers don’t care about:
Move data to/from sink/source
Support delivery semantics
Offset Management
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration
Developers care about:
Domain specific transformations
9. CONNECTORS
Kafka Connect’s framework allows developers to create connectors that
copy data to/from other systems just by writing configuration files and
submitting them to Connect with no code necessary
10. Connector configurations are key-value mappings
name connector’s unique name
connector.class connector’s java class
tasks.max maximum tasks to create
topics list of topics (to source or sink data)
11. Introducing a query language for the connectors
name connector’s unique name
connector.class connector’s java class
tasks.max maximum tasks to create
topics list of topics (to source or sink data)
query KCQL query specifies fields/actions for the target system
12. KCQL
Kafka Connect Query Language
is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors and then some more..
Example:
Project fields, rename or ignore them and further customise in plain text
INSERT INTO transactions SELECT field1 AS column1, field2 AS column2, field3 FROM TransactionTopic;
INSERT INTO audits SELECT * FROM AuditsTopic;
INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE;
INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
13. So while integrating
Kafka with in-memory
data grid, key-value,
document stores,
NoSQL, search etc
systems..
INSERT INTO $TARGET
SELECT *|columns(i.e col1,col2 | col1 AS column1,col2)
FROM $TOPIC_NAME
[ IGNORE columns ]
[ AUTOCREATE ]
[ PK columns ]
[ AUTOEVOLVE ]
[ BATCH = N ]
[ CAPITALIZE ]
[ INITIALIZE ]
[ PARTITIONBY cola[,colb] ]
[ DISTRIBUTEBY cola[,colb] ]
[ CLUSTERBY cola[,colb] ]
[ TIMESTAMP cola|sys_current ]
[ STOREAS $YOUR_TYPE([key=value, .....]) ]
[ WITHFORMAT TEXT|AVRO|JSON|BINARY|OBJECT|MAP ]
KCQL
How does it look like?
14. Topic to target mapping
Field selection
Auto creation
Auto evolution
Error policies
Multiple KCQLs / topic
- Field extraction
- Access to Key & Metadata
Why KCQL ?
Thank you very much for coming today. I will be delivering a talk about building Fast Data systems with Kafka
My name is Antonios. I’ve been involved with open-source projects on distributed systems of the Hadoop eco-system, and currently, i’m having Apache Kafka in my heart :)
I have authored a book on MapReduce using Scalding and co-authored another one
Landoop is a company starting-up and focusing on DevOps, Distributed Systems and particularly Apache Kafka
Today i’d like to start the presentation with Kafka Connect. I guess most of you are already familiar with it, so will give a quick overview
Kafka Connect was introduced almost one year ago, as a feature of Apache Kafka 0.9+ with the narrow (although very important) scope of copying streaming data from and to a Kafka cluster. I found the concept really interesting and decided to experiment with it to see what this framework introduces.
Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.
The announcement by confluent
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
And this is how Kafka Connect fits into the picture on a Kafka based system.
You would normally use a stream processing framework to transform your data streams i.e. Spark, K-Streams, etc
And what Kafka Connect offers is the separation of concerns. It can simplify the key stages of the ETL process, and using simple tools, we can build and maintain distributed streaming data pipeline.
The E (the extraction) and the L (the load) can be taken care for you, and then as a developer you can focus on the T (the transformations)
By combining Kafka Connect and stream processing engines we can perform streaming ETLs. Each does the job it is best at, and Kafka acts as the underlying data storage layer that supports them and allows simple integration with a variety of other applications.
By using a robust framework that delivers scalability and fault tolerance out-of-the-box we can then focus on extracting value in a transformation layer.
deployments to deliver fault tolerance and automated load balancing
As you can see here, the basic pattern is to use Kafka Connect to perform Extraction of the data and load it into Kafka as a temporary, scalable, fault tolerant streaming data store. While you can do this with other, more generic data copy tools, you’ll commonly lose important semantics such as at least once delivery of data. Once the data is extracted, you use stream processing engines to perform Transformation and either this is the endpoint for the data or you can deliver it back into Kafka. Finally, Load the data with another Kafka connector into the destination system. Obviously this is a simplified picture and your pipeline will grow much more complex, have multiple stages of transformation (especially if the intermediate data is useful for reuse by multiple applications, including anything downstream that may not be processed by stream processing engines).
Most configurations are connector dependent, but there are a few settings common to all connectors
What we are introducing to all our Kafka connectors is the KCQL query
Let’s look at some of the more advanced features of KCQL - and in particular regarding some sinks.
Hazelcast for example supports the Ring Buffer Data structure, which is quite popular from the Disruptor pattern. Data can be pushed in a fixed-size buffer, with a particular retention period. If the buffer gets filled - an eviction policy will be triggered - to either evict oldest records, or deny the addition of new records.
So to write some IoT data from a Kafka topic into a Ring Buffer - we can use the STOREAS keyword.
On the right side, you can see how we can store the same data into a RELIABLE TOPIC - another hazel cast data-structure.
*Hazelcast requires data to be serializable, and JSON and Avro are supported.
Redis provides the Sorted Set data structure. This structure allows only unique elements to be added - and each element is required to be scored - to enforce ordering.
This data structure is oftenly used to preserve time-series data, as Redis allows running time-range queries.
So if we have a Kafka topic with Foreign Exchange data, we can either
-store all the messages into a SortedSet ( the one with the blue colour) OR-create a new SortedSet for each symbol ( one SortedSet for each currency rate ) using the PK syntax on the right
So this is a list of Apache 2.0 licensed Kafka Connectors that we have been working on.
Blockchain, Bloomberg, the Cassandra connector that is certified by DataStax, a Constrained Application Protocol connector, Elastic Search, JMS, MQTT and others are some of the connectors already available, and released against the 2 latest releases of Apache Kafka.
https://github.com/Landoop/fast-data-dev
So let’s see a DEMO in real-time
http://fast-data-dev.demo.landoop.com
So let’s see a DEMO in real-time
https://coyote.landoop.com/connect/
So let’s see a DEMO in real-time
http://schema-registry-ui.landoop.com
So let’s see a DEMO in real-time
http://kafka-topics-ui.landoop.com
So let’s see a DEMO in real-time
http://kafka-connect-ui.landoop.com
Connectors look overall simple - and i know a number of people in this room already using them in production. So how does performance look like ?
This image above demonstrates that depending on the sink system - we can sink 50 K records / sec by using:
20 partitions
3 connect tasks
5 GB RAM / connector
less than 2 CPUs
On the bottom-left corner - we can see that we have saturated 50% of the available network bandwidth.
Depending on the number of tasks and partitions - we can easily increase sink performance to more than 100K records / sec.
The lesson regarding performance is that:
Kafka Connect can scale really well
It requires quite some memory
and quite some CPUs especially if batching writes
We have also send Pull Requests to the prometheus team - to enable GZIP compression - to minimise any impact in the running system, something that has significantly decreased the network i/o
We then provide pre-built DashBoards on Grafana
We are using Grafana version 4.0 released a few months ago - that allows alerting that is a really revolutionary feature as it transforms Grafana from a visualisation tool into a truly mission critical monitoring tool
We’ll have a demo, but before going into it ..
Before doing a Live presentation - i’d like to answer a question :
How do i ship such a complex infrastructure that can easily grow into Hundreds of running services ?
We preferably use:
Deployment apps such as Ansible
Docker based technologies for state-less micro-services
CDH based integration with Cloudera Managed for CDH Hadoop clusters
https://docs.landoop.com/
CDH docs - https://docs.landoop.com/
More connectors are added monthly
Time-Travel in Kafka topics and KCQL queries and real-time