3. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- The Schema Registry is a distributed storage layer for Avro Schemas which uses
Kafka as its underlying storage mechanism.
- Assigns globally unique id to each registered schema. Allocated ids are
guaranteed to be monotonically increasing but not necessarily consecutive.
- Kafka provides the durable backend, and functions as a write-ahead changelog for
the state of the Schema Registry and the schemas it contains.
- The Schema Registry is designed to be distributed, with single-master
architecture, and ZooKeeper/Kafka coordinates master election
- Memory & CPU consumption is really low, even for bigger kafka clusters or many
event schemas.
- Schema Registry doesn’t have its own disk data, all the schemas reside in a kafka
log store + in memory indices.
- Avoid running schema registry clusters in different data centers, the increase in
network and/or network availability problems may hinder the performance of the
whole kafka cluster.
- Backup your schemas topic constantly (e.g. with kafka sink connector)
3
KAFKA SCHEMA REGISTRY
4. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Kafka connect is a framework for connecting Kafka with external systems and
data sources.
- It allows for source connectors to ingest data from various elements e. g.
databases, filesystems, logs, etc.
- On the other hand, sink connectors delivers data from kafka topics onto other
indexes (Elasticsearch, Hadoop, etc).
- It can be run standalone or clusterized.
- Connectors allow for an abstraction layer when pulling or pushing data to Kafka
- They are extremely flexible and scalable (Clusterization, batching, streaming, etc)
- One can extend or reuse the connectors with ease.
4
KAFKA CONNECT
5. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
- Connectors define the source and target of the data.
- A connector instance manages the copying of data between kafka and another
system.
- A connector plugin is the packaging of the classes used by said instance, which
then one can deploy in instances and replicate at will
5
KAFKA CONNECT: Connectors
6. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
6
KAFKA CONNECT: Tasks
- Each connector instance coordinates a set of tasks that copy the data.
- The tasks have no state stored in them, therefore parallelism and scalability is
really easy.
- Task state is stored in kafka special topics for both configuration and status, and
the connector instance manages them.
7. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
7
KAFKA CONNECT: Task rebalancing
- When a connector is first submitted the the workers rebalance the full set of
connectors in the cluster and their tasks so that each worker has approximately
the same amount of work.
- When a worker fails, tasks are rebalanced across healthy workers.
- When a task fails no rebalancing takes place, and the the task should be restarted
via the Connect REST API.
8. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
8
KAFKA CONNECT: Workers
- Workers are responsible of running connectors and their associated tasks.
- There are two main types of workers, standalone and distributed.
- Standalone workers are the simplest, where a single process is responsible for
executing all connectors and tasks.
- Distributed workers provide scalability and fault tolerance, since they start may
processes and coordinate execution of connectors and tasks across all of them.
9. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
9
KAFKA CONNECT: Converters
- They handle the support for different data formats when writing or reading in a
kafka cluster.
- Tasks use converters to change the format of the data.
- By default, Kafka connect provides with: AvroConverter, JsonConverter,
StringConverter, ByteArrayConverter
- They are completely decoupled from connectors, allowing for reusability.
10. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
10
KAFKA CONNECT: Transforms
- Connectors can be configured with transformations to make simple and
lightweight modifications to individual messages.
- Really convenient for minor data adjustments.
- They are chainable, and you can write your own thanks to the transformation
interface.
- Kafka connect bundles some of them already:
Cast: Cast fields or the entire key or value to a specific type, e.g. to force an integer field to
a smaller width.
ExtractField Extract the specified field from a Struct when schema present, or a Map in the case of schemaless
data.
Flatten Flatten a nested data structure.
HoistField Wrap data using the specified field name in a Struct or Map
InsertField Insert field using attributes from the record metadata or a configured static value.
MaskField Mask specified fields with a valid null value for the field type.
RegexRouter Update the record topic using the configured regular expression and replacement string.
ReplaceField Filter or rename fields.
SetSchemaMetadata Set the schema name, version, or both on the record’s key or value schema.
TimestampConverter Convert timestamps between different formats.
TimestampRouter Update the record’s topic field as a function of the original topic value and the record timestamp.
ValueToKey Replace the record key with a new key formed from a subset of fields in the record value.
11. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
11
KAFKA KSQL
- KSQL is the open source streaming SQL engine for Apache Kafka
- Provides an easy-to-use yet powerful interactive SQL interface for stream
processing on Kafka, without the need to write code in a programming language
such as Java or Python.
- Scalable, elastic, fault-tolerant, and real-time.
- Supports a wide range of streaming operations, including data filtering,
transformations, aggregations, joins, windowing, and sessionization.
- Some applications are streaming ETL, real-time monitoring and analytics, data
exploration and discovery, anomaly detection, personalization, IoT, and customer
360.
12. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
12
KAFKA KSQL components
KSQL Server
- The KSQL server runs the engine that executes KSQL queries. This includes
processing, reading, and writing data to and from the target Kafka cluster.
- KSQL servers form KSQL clusters and can run in containers, virtual machines, and
bare-metal machines. You can add and remove servers to/from the same KSQL
cluster during live operations to elastically scale KSQL’s processing capacity as
desired. You can deploy different KSQL clusters to achieve workload isolation.
KSQL CLI
- You can interactively write KSQL queries by using the KSQL command line
interface (CLI). The KSQL CLI acts as a client to the KSQL server. For production
scenarios you may also configure KSQL servers to run in non-interactive
“headless” configuration, thereby preventing KSQL CLI access.
13. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
13
KAFKA REST Proxy
Provides a RESTful interface to a Kafka cluster, making it easy to produce and consume
messages, view the state of the cluster, and perform administrative actions without
using the native Kafka protocol or clients. Some of the key features are:
- Metadata: Most metadata about the cluster – brokers, topics, partitions, and
configs – can be read using GET requests for the corresponding URLs.
- Producers: Instead of exposing producer objects, the API accepts produce
requests targeted at specific topics or partitions and routes them all through a
small pool of producers. Producer instances are shared, so configs cannot be set
on a per-request basis. However, you can adjust settings globally by passing new
producer settings in the REST Proxy configuration.
14. Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
14
KAFKA REST Proxy
- Consumers: The REST Proxy uses either the high level consumer (v1 api) or the
new 0.9 consumer (v2 api) to implement consumer-groups that can read from
topics. Consumers are stateful and therefore tied to specific REST Proxy instances.
Offset commit can be either automatic or explicitly requested by the user.
Currently limited to one thread per consumer; use multiple consumers for higher
throughput.
- Data Formats: The REST Proxy can read and write data using JSON, raw bytes
encoded with base64 or using JSON-encoded Avro. With Avro, schemas are
registered and validated against the Schema Registry.
- REST Proxy Clusters and Load Balancing: The REST Proxy is designed to support
multiple instances running together to spread load and can safely be run behind
various load balancing mechanisms (e.g. round robin DNS, discovery services, load
balancers.