Delivering Fast Data Systems with Kafka
LANDOOP
www.landoop.com
Antonios Chalkiopoulos
18/1/2017
@chalkiopoulos
Open Source contributor
Big Data projects in Media, Betting, Retail and 

Investment Banks in London
Books
Author, Programming MapReduce with Scalding


Founder of Landoop
DevOps Big Data Scala
Automation Distributed Systems Monitoring
Hadoop Fast Data / Streams Kafka
KAFKA CONNECT
a bit of context
KAFKA CONNECT
“a common framework
for allowing stream data flow
between kafka and other systems”
Data is produced from a source and consumed to a sink.
Data Source
KafkaConnect
KafkaConnect
KAFKA Data SinkData Source
KafkaConnect
KafkaConnect
KAFKA Data Sink
Stream processing
Data Source
KafkaConnect
KafkaConnect
KAFKA Data Sink
Stream processing
E T L
Developers don’t care about:

Move data to/from sink/source
Support delivery semantics
Offset Management
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration
Developers care about:

Domain specific transformations
CONNECTORS
Kafka Connect’s framework allows developers to create connectors that
copy data to/from other systems just by writing configuration files and
submitting them to Connect with no code necessary
Connector configurations are key-value mappings
name connector’s unique name
connector.class connector’s java class
tasks.max maximum tasks to create
topics list of topics (to source or sink data)
Introducing a query language for the connectors
name connector’s unique name
connector.class connector’s java class
tasks.max maximum tasks to create
topics list of topics (to source or sink data)
query KCQL query specifies fields/actions for the target system
KCQL
Kafka Connect Query Language
is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors and then some more..
Example:
Project fields, rename or ignore them and further customise in plain text
INSERT INTO transactions SELECT field1 AS column1, field2 AS column2, field3 FROM TransactionTopic;
INSERT INTO audits SELECT * FROM AuditsTopic;
INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE;
INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
So while integrating
Kafka with in-memory
data grid, key-value,
document stores,
NoSQL, search etc
systems..
INSERT INTO $TARGET
SELECT *|columns(i.e col1,col2 | col1 AS column1,col2)
FROM $TOPIC_NAME
[ IGNORE columns ]
[ AUTOCREATE ]
[ PK columns ]
[ AUTOEVOLVE ]
[ BATCH = N ]
[ CAPITALIZE ]
[ INITIALIZE ]
[ PARTITIONBY cola[,colb] ]
[ DISTRIBUTEBY cola[,colb] ]
[ CLUSTERBY cola[,colb] ]
[ TIMESTAMP cola|sys_current ]
[ STOREAS $YOUR_TYPE([key=value, .....]) ]
[ WITHFORMAT TEXT|AVRO|JSON|BINARY|OBJECT|MAP ]
KCQL
How does it look like?
Topic to target mapping
Field selection
Auto creation
Auto evolution
Error policies
Multiple KCQLs / topic 

- Field extraction

- Access to Key & Metadata
Why KCQL ?
KCQL
Advanced Features Examples
KCQL |
{ "sensor_id": "01" , "temperature": 52.7943, "ts": 1484648810 }
{ “sensor_id": "02" , "temperature": 28.8597, "ts": 1484648810 }
Example Kafka topic with IoT data
INSERT INTO sensor_ringbuffer 

SELECT sensor_id, temperature, ts 

FROM coap_sensor_topic 

WITHFORMAT JSON

STOREAS RING_BUFFER
INSERT INTO sensor_reliabletopic 

SELECT sensor_id, temperature, ts

FROM coap_sensor_topic 

WITHFORMAT AVRO

STOREAS RELIABLE_TOPIC
INSERT INTO FXSortedSet 

SELECT symbol, price 

FROM yahooFX-topic 

STOREAS SortedSet(score=ts)
SELECT price 

FROM yahooFX-topic 

PK symbol 

STOREAS SortedSet(score=ts)
KCQL |
{ "symbol": "USDGBP" , "price": 0.7943, "ts": 1484648810 }
{ "symbol": "EURGBP" , "price": 0.8597, "ts": 1484648810 }
Example Kafka topic with FX data
B:1 A:2 D:3 C:20
Sorted Set -> { value : score }
Stream reactor connectors support KCQL
kafka-connect-blockchain
kafka-connect-bloomberg
kafka-connect-cassandra
kafka-connect-coap
kafka-connect-druid
kafka-connect-elastic
kafka-connect-ftp
kafka-connect-hazelcast
kafka-connect-hbase
kafka-connect-influxdb
kafka-connect-jms
kafka-connect-kudu
kafka-connect-mongodb
kafka-connect-mqtt
kafka-connect-redis
kafka-connect-rethink
kafka-connect-voltdb
kafka-connect-yahoo
Source: https://github.com/datamountaineer/stream-reactor
Integration Tests: http://coyote.landoop.com/connect/
DEMO
Kafka Connect InfluxDB
We ‘ll need:
• Zookeeper
• Kafka Broker
• Schema Registry
• Kafka Connect Distributed
• Kafka REST Proxy
We ‘ll also use:
• StreamReactor connectors
• Landoop Fast Data Web Tools
docker run --rm -it 
-p 2181:2181 -p 3030:3030 -p 8081:8081 
-p 8082:8082 -p 8083:8083 -p 9092:9092 
-e ADV_HOST=192.168.99.100 
landoop/fast-data-dev
case class DeviceMeasurements(

deviceId: Int,
temperature: Int,
moreData: String,
timestamp: Long)
We’ll generate some Avro messages
DEMO
Kafka Development Environment
@ Fast-data-dev docker image
https://hub.docker.com/r/landoop/fast-data-dev/
DEMO
Integration testing with Coyote
for connectors & infrastructure
https://github.com/Landoop/coyote
Schema Registry UI
https://github.com/Landoop/schema-registry-ui
Kafka Topics UI
https://github.com/Landoop/kafka-topics-ui
Kafka Connect UI
https://github.com/Landoop/kafka-connect-ui
Connectors Performance
Monitoring & Alerting
via JMX
Deployment

apps
Containers 

mesos -kubernetes
Hadoop 

integration
* state-less apps = container-friendly

schema registry, kafka connect
How do I IT?
Available features: 

Kafka ecosystem
StreamReactor
Connectors
Landoop web tools
Monitoring & Alerting
Security features
Wrap up
- KCQL
- Connectors
- Kafka Web Tools
- Automation & Integrations
Coming up
- Kafka backend
enhanced UIs | Timetravel
$ locate
https://github.com/Landoop
https://hub.docker.com/r/landoop/
https://github.com/datamountaineer/stream-reactor
http://www.landoop.com
Thank you ;)

London Apache Kafka Meetup (Jan 2017)

  • 1.
    Delivering Fast DataSystems with Kafka LANDOOP www.landoop.com Antonios Chalkiopoulos 18/1/2017
  • 2.
    @chalkiopoulos Open Source contributor BigData projects in Media, Betting, Retail and 
 Investment Banks in London Books Author, Programming MapReduce with Scalding 
 Founder of Landoop
  • 3.
    DevOps Big DataScala Automation Distributed Systems Monitoring Hadoop Fast Data / Streams Kafka
  • 4.
  • 5.
    KAFKA CONNECT “a commonframework for allowing stream data flow between kafka and other systems”
  • 6.
    Data is producedfrom a source and consumed to a sink. Data Source KafkaConnect KafkaConnect KAFKA Data SinkData Source KafkaConnect KafkaConnect KAFKA Data Sink Stream processing
  • 7.
  • 8.
    Developers don’t careabout:
 Move data to/from sink/source Support delivery semantics Offset Management Serialization / de-serialization Partitioning / Scalability Fault tolerance / fail-over Schema Registry integration Developers care about:
 Domain specific transformations
  • 9.
    CONNECTORS Kafka Connect’s frameworkallows developers to create connectors that copy data to/from other systems just by writing configuration files and submitting them to Connect with no code necessary
  • 10.
    Connector configurations arekey-value mappings name connector’s unique name connector.class connector’s java class tasks.max maximum tasks to create topics list of topics (to source or sink data)
  • 11.
    Introducing a querylanguage for the connectors name connector’s unique name connector.class connector’s java class tasks.max maximum tasks to create topics list of topics (to source or sink data) query KCQL query specifies fields/actions for the target system
  • 12.
    KCQL Kafka Connect QueryLanguage is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors and then some more.. Example: Project fields, rename or ignore them and further customise in plain text INSERT INTO transactions SELECT field1 AS column1, field2 AS column2, field3 FROM TransactionTopic; INSERT INTO audits SELECT * FROM AuditsTopic; INSERT INTO logs SELECT * FROM LogsTopic AUTOEVOLVE; INSERT INTO invoices SELECT * FROM InvoiceTopic PK invoiceID;
  • 13.
    So while integrating Kafkawith in-memory data grid, key-value, document stores, NoSQL, search etc systems.. INSERT INTO $TARGET SELECT *|columns(i.e col1,col2 | col1 AS column1,col2) FROM $TOPIC_NAME [ IGNORE columns ] [ AUTOCREATE ] [ PK columns ] [ AUTOEVOLVE ] [ BATCH = N ] [ CAPITALIZE ] [ INITIALIZE ] [ PARTITIONBY cola[,colb] ] [ DISTRIBUTEBY cola[,colb] ] [ CLUSTERBY cola[,colb] ] [ TIMESTAMP cola|sys_current ] [ STOREAS $YOUR_TYPE([key=value, .....]) ] [ WITHFORMAT TEXT|AVRO|JSON|BINARY|OBJECT|MAP ] KCQL How does it look like?
  • 14.
    Topic to targetmapping Field selection Auto creation Auto evolution Error policies Multiple KCQLs / topic 
 - Field extraction
 - Access to Key & Metadata Why KCQL ?
  • 15.
  • 16.
    KCQL | { "sensor_id":"01" , "temperature": 52.7943, "ts": 1484648810 } { “sensor_id": "02" , "temperature": 28.8597, "ts": 1484648810 } Example Kafka topic with IoT data INSERT INTO sensor_ringbuffer 
 SELECT sensor_id, temperature, ts 
 FROM coap_sensor_topic 
 WITHFORMAT JSON
 STOREAS RING_BUFFER INSERT INTO sensor_reliabletopic 
 SELECT sensor_id, temperature, ts
 FROM coap_sensor_topic 
 WITHFORMAT AVRO
 STOREAS RELIABLE_TOPIC
  • 17.
    INSERT INTO FXSortedSet
 SELECT symbol, price 
 FROM yahooFX-topic 
 STOREAS SortedSet(score=ts) SELECT price 
 FROM yahooFX-topic 
 PK symbol 
 STOREAS SortedSet(score=ts) KCQL | { "symbol": "USDGBP" , "price": 0.7943, "ts": 1484648810 } { "symbol": "EURGBP" , "price": 0.8597, "ts": 1484648810 } Example Kafka topic with FX data B:1 A:2 D:3 C:20 Sorted Set -> { value : score }
  • 18.
    Stream reactor connectorssupport KCQL kafka-connect-blockchain kafka-connect-bloomberg kafka-connect-cassandra kafka-connect-coap kafka-connect-druid kafka-connect-elastic kafka-connect-ftp kafka-connect-hazelcast kafka-connect-hbase kafka-connect-influxdb kafka-connect-jms kafka-connect-kudu kafka-connect-mongodb kafka-connect-mqtt kafka-connect-redis kafka-connect-rethink kafka-connect-voltdb kafka-connect-yahoo Source: https://github.com/datamountaineer/stream-reactor Integration Tests: http://coyote.landoop.com/connect/
  • 19.
    DEMO Kafka Connect InfluxDB We‘ll need: • Zookeeper • Kafka Broker • Schema Registry • Kafka Connect Distributed • Kafka REST Proxy We ‘ll also use: • StreamReactor connectors • Landoop Fast Data Web Tools docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST=192.168.99.100 landoop/fast-data-dev case class DeviceMeasurements(
 deviceId: Int, temperature: Int, moreData: String, timestamp: Long) We’ll generate some Avro messages
  • 20.
    DEMO Kafka Development Environment @Fast-data-dev docker image https://hub.docker.com/r/landoop/fast-data-dev/
  • 21.
    DEMO Integration testing withCoyote for connectors & infrastructure https://github.com/Landoop/coyote
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Deployment
 apps Containers 
 mesos -kubernetes Hadoop
 integration * state-less apps = container-friendly
 schema registry, kafka connect How do I IT? Available features: 
 Kafka ecosystem StreamReactor Connectors Landoop web tools Monitoring & Alerting Security features
  • 30.
    Wrap up - KCQL -Connectors - Kafka Web Tools - Automation & Integrations
  • 31.
    Coming up - Kafkabackend enhanced UIs | Timetravel
  • 32.
  • 33.

Editor's Notes

  • #2 Thank you very much for coming today. I will be delivering a talk about building Fast Data systems with Kafka
  • #3 My name is Antonios. I’ve been involved with open-source projects on distributed systems of the Hadoop eco-system, and currently, i’m having Apache Kafka in my heart :) I have authored a book on MapReduce using Scalding and co-authored another one
  • #4 Landoop is a company starting-up and focusing on DevOps, Distributed Systems and particularly Apache Kafka
  • #5 Today i’d like to start the presentation with Kafka Connect. I guess most of you are already familiar with it, so will give a quick overview
  • #6 Kafka Connect was introduced almost one year ago, as a feature of Apache Kafka 0.9+ with the narrow (although very important) scope of copying streaming data from and to a Kafka cluster. I found the concept really interesting and decided to experiment with it to see what this framework introduces. Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale. The announcement by confluent https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
  • #7 And this is how Kafka Connect fits into the picture on a Kafka based system. You would normally use a stream processing framework to transform your data streams i.e. Spark, K-Streams, etc
  • #8 And what Kafka Connect offers is the separation of concerns. It can simplify the key stages of the ETL process, and using simple tools, we can build and maintain distributed streaming data pipeline. The E (the extraction) and the L (the load) can be taken care for you, and then as a developer you can focus on the T (the transformations) By combining Kafka Connect and stream processing engines we can perform streaming ETLs. Each does the job it is best at, and Kafka acts as the underlying data storage layer that supports them and allows simple integration with a variety of other applications.
  • #9 By using a robust framework that delivers scalability and fault tolerance out-of-the-box we can then focus on extracting value in a transformation layer. deployments to deliver fault tolerance and automated load balancing As you can see here, the basic pattern is to use Kafka Connect to perform Extraction of the data and load it into Kafka as a temporary, scalable, fault tolerant streaming data store. While you can do this with other, more generic data copy tools, you’ll commonly lose important semantics such as at least once delivery of data. Once the data is extracted, you use stream processing engines to perform Transformation and either this is the endpoint for the data or you can deliver it back into Kafka. Finally, Load the data with another Kafka connector into the destination system. Obviously this is a simplified picture and your pipeline will grow much more complex, have multiple stages of transformation (especially if the intermediate data is useful for reuse by multiple applications, including anything downstream that may not be processed by stream processing engines).
  • #11 Most configurations are connector dependent, but there are a few settings common to all connectors
  • #12 What we are introducing to all our Kafka connectors is the KCQL query
  • #16 Let’s look at some of the more advanced features of KCQL - and in particular regarding some sinks.
  • #17 Hazelcast for example supports the Ring Buffer Data structure, which is quite popular from the Disruptor pattern. Data can be pushed in a fixed-size buffer, with a particular retention period. If the buffer gets filled - an eviction policy will be triggered - to either evict oldest records, or deny the addition of new records. So to write some IoT data from a Kafka topic into a Ring Buffer - we can use the STOREAS keyword. On the right side, you can see how we can store the same data into a RELIABLE TOPIC - another hazel cast data-structure. *Hazelcast requires data to be serializable, and JSON and Avro are supported.
  • #18 Redis provides the Sorted Set data structure. This structure allows only unique elements to be added - and each element is required to be scored - to enforce ordering. This data structure is oftenly used to preserve time-series data, as Redis allows running time-range queries. So if we have a Kafka topic with Foreign Exchange data, we can either -store all the messages into a SortedSet ( the one with the blue colour) OR -create a new SortedSet for each symbol ( one SortedSet for each currency rate ) using the PK syntax on the right
  • #19 So this is a list of Apache 2.0 licensed Kafka Connectors that we have been working on. Blockchain, Bloomberg, the Cassandra connector that is certified by DataStax, a Constrained Application Protocol connector, Elastic Search, JMS, MQTT and others are some of the connectors already available, and released against the 2 latest releases of Apache Kafka.
  • #20 https://github.com/Landoop/fast-data-dev
  • #21 So let’s see a DEMO in real-time http://fast-data-dev.demo.landoop.com
  • #22 So let’s see a DEMO in real-time https://coyote.landoop.com/connect/
  • #23 So let’s see a DEMO in real-time http://schema-registry-ui.landoop.com
  • #24 So let’s see a DEMO in real-time http://kafka-topics-ui.landoop.com
  • #25 So let’s see a DEMO in real-time http://kafka-connect-ui.landoop.com
  • #26 Connectors look overall simple - and i know a number of people in this room already using them in production. So how does performance look like ? This image above demonstrates that depending on the sink system - we can sink 50 K records / sec by using: 20 partitions 3 connect tasks 5 GB RAM / connector less than 2 CPUs On the bottom-left corner - we can see that we have saturated 50% of the available network bandwidth. Depending on the number of tasks and partitions - we can easily increase sink performance to more than 100K records / sec. The lesson regarding performance is that: Kafka Connect can scale really well It requires quite some memory and quite some CPUs especially if batching writes
  • #27 We have also send Pull Requests to the prometheus team - to enable GZIP compression - to minimise any impact in the running system, something that has significantly decreased the network i/o We then provide pre-built DashBoards on Grafana We are using Grafana version 4.0 released a few months ago - that allows alerting that is a really revolutionary feature as it transforms Grafana from a visualisation tool into a truly mission critical monitoring tool We’ll have a demo, but before going into it ..
  • #28 Before doing a Live presentation - i’d like to answer a question : How do i ship such a complex infrastructure that can easily grow into Hundreds of running services ? We preferably use: Deployment apps such as Ansible Docker based technologies for state-less micro-services CDH based integration with Cloudera Managed for CDH Hadoop clusters
  • #29 https://docs.landoop.com/
  • #30 CDH docs - https://docs.landoop.com/
  • #31 More connectors are added monthly
  • #32 Time-Travel in Kafka topics and KCQL queries and real-time
  • #33 http://www.landoop.com https://github.com/Landoop https://github.com/datamountaineer/stream-reactor https://hub.docker.com/r/landoop/