Kafka meetup - kafka connect

Kafka Connect
Kafka Meetup 2019
Yi Zhang
Software Engineer

Building Data Pipeline
Observability using
Elasticsearch and Kibana

Kafka Connect
Sink
Kafka
Data Monitoring Platform

Data Format Standard is important!
Enter Section/Running Header Here
syntax="proto2";
message envelope {
# required fields
required string data_type = 1;
required string create_at_us = 1;
required string source_name = 1;
# optional
string schema = 1;
# payload
bytes payload = 1;
}

• To query Kafka message in real-time
• To quickly find the location of a message
• To trace for a historic event for debugging/diagnose
• To monitor data quality in the pipeline
• To monitor and project data volume in the pipeline for capacity planning
• To detect abnormal data patterns
Why?

• Quick overview of Kafka Connect
• How does the data transformation work in Kafka Connect
• What is SMT
• Some use cases for SMT
• SMT vs Kafka Streams for data transformation
• Tips for using Kafka Connect to sink data to Elasticsearch
Takeaways

Quick Overview of
Kafka Connect

• Lightweight and stateless
• Scalable and fault-tolerant
• Integrates with Kafka and many other data systems
• Pluggable architecture make customization easy and configurable
• Lots open source (connectors and converter plugins) available
• Run in two modes:
• standalone mode is great for dev and local testing
• distributed mode is great for scaling and fault-tolerance
• REST API available to monitor and configure your connectors in the distributed
mode
More reasons to use Kafka Connect

Data Sink
Converter
Transformers
Connector
Kafka Connect SinkKafka
Order of Operation

● Default AVRO or JSON or write your own
● Configurable
○ Different data converter for key and value
○ Specify how null, invalid or malformed message should be handled
● Kafka Connect isolates each plugin from one another so that libraries in one plugin
are not affected by the libraries in any other plugins
○ `plugin.path` is configured in the Kafka Connect worker configuration
○ Build your JAR with dependency and copy it to `plugin.path`.
Plugin: Data Converter
# directory other than the home directory of Confluent Platform.
plugin.path=share/java

# Data converter plugin
value.converter.protoClassName=net.demonware.pipes.connect.data.proto.MessageEnvelopeOuterClass$Mess
ageEnvelope
Plugin: Data Converter

Single Message
Transformation (SMT)

• Modifies messages going out of Kafka before it reaches Elasticsearch
• One message at a time
• Many built-in SMT are already available
• Flexible within the constraints of the TransformableRecord API and 1:{0,1}
mapping
• Transformation is chained
• Pluggable transformers through Connect configuration
What is SMT?

Default Kafka Connect SMT
Field Name Included in Kibana
InsertField Insert field using attributes from the record metadata or a configured
static value.
MaskField Mask specified fields with a valid null value for the field type.
ReplaceField Filter or rename fields.
TimestampConverter Convert timestamps between different formats such as Unix epoch,
strings, and Connect Date and Timestamp types.
TimestampRouter Update the record’s topic field as a function of the original topic value
and the record timestamp.
RegexRouter Update the record topic using the configured regular expression and
replacement string.

Field Name Included in Kibana
Cast Cast fields or the entire key or value to a specific type, e.g. to force an
integer field to a smaller width.
ExtractField Extract the specified field from a Struct when schema present, or a Map
in the case of schemaless data. Any null values are passed through
unmodified
ExtractTopic Replace the record topic with a new topic derived from its key or value.
Flatten Flatten a nested data structure. This generates names for each field by
concatenating the field names at each level with a configurable delimiter
character.
HoistField Wrap data using the specified field name in a Struct when schema
present, or a Map in the case of schemaless data.
ValueToKey Replace the record key with a new key formed from a subset of fields in
the record value.

• An alias in transforms implies that some additional keys are configurable.
• Syntax:
• transforms.$alias.type – fully qualified class name for the transformation
• transforms.$alias.* – all other keys as defined in Transformation.config() are
embedded with this prefix
• Example:
Configuring SMT
transforms.insertKafkaMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.insertKafkaMetadata.topic.field=kafka_topic
transforms.removeFields.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.removeFields.blacklist=context,tracing,payload
transforms.convertTimestampUnit.type=net.demonware.pipes.kafka.connect.transforms.ConvertTimeToMillis$Value
transforms.convertTimestampUnit.timestamp.fields=created_at_us,ingested_at_us

• SMT is chained
• SMT are applied in the order they are specified in `transforms`.
• If your transformation is order dependent then need to make sure they are specified in the correct order
• Example:
Ordering of SMT matters!
transforms=insertKafkaMetadata,indexMapping
transforms.indexMapping.type:org.apache.kafka.connect.transforms.TimestampRouter
transforms.indexMapping.topic.format:topic-changed-${timestamp}
transforms.indexMapping.timestamp.format:yyyy.MM.dd
transforms.insertKafkaMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.insertKafkaMetadata.topic.field=kafka_topic

• Only if you cannot use the built-in and cannot use Kafka Streams for the data
transformation.
• Must implement the Transformation interface.
• Consider to make your SMT configurable.
• If you have multiple custom SMT, better to have separate Transformation
implementation.
Create Custom SMT

// Existing base class for SourceRecord and SinkRecord, new self type parameter.
public abstract class ConnectRecord<R extends ConnectRecord<R>> {
// ...
// New abstract method:
/** Generate a new record of the same type as itself, with the specified parameter values. **/
public abstract R newRecord(String topic, Schema keySchema, Object key, Schema valueSchema, Object value, Long timestamp);
}
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
// via Configurable base interface:
// void configure(Map<String, ?> configs);
/**
* Apply transformation to the {@code record} and return another record object (which may be {@code record} itself) or {@code null},
* corresponding to a map or filter operation respectively. The implementation must be thread-safe.
*/
R apply(R record);
/** Configuration specification for this transformation. **/
ConfigDef config();
/** Signal that this transformation instance will no longer will be used. **/
@Override
void close();
}
Interface: Transformation

Map data format from
Kafka to
Elasticsearch

Compute and add
monitoring metric
data

● Recommended practice in general
● Transformation involves multiple
messages, such as aggregation.
● More complex transformation:
aggregation, windowing, joining
● When the transformed data will be
consumed by multiple downstream
consumers. Reduce overhead by
running transformation only once and
allow reuse.
● Lightweight and simple data
transformation.
● Covered by the Kafka Connect
built-in SMT
● Data footprint cost is a concern.
Large amount of transformed data
written back to Kafka is too costly.
● Simplicity in streaming data pipeline
is important. Want to keep data
pipeline stage and services to a
minimum
● Transformation does not interact with
external systems
SMT
Kafka Streams

• Overwrite the ES @timestamp internal field
• Overwrite the document ‘_id’ field to have control over how should your data be
de-duplicated
• Remove unnecessary columns/fields to save space and footprint of your ES cluster.
• Manage your ES Index by day. You can use ‘TimestampRouter’ and ‘RegexRouter’ SMT to
generate ES indice per day for your data.
• Have binary data available for search in a user-friendly format then you need to transform
the binary data prior to indexing.
“123e4567-e89b-12d3-a456-426655440000”
Do’s

● Some cosmetic data format tweaking can be done in Kibana
○ Date display format
○ Base64 Decode binary data for display
○ Type casting from integer to text
● If you need to modify Kafka Connect source code for any reason then you might want to
reconsider using Kafka Connect
○ it can be hard to debug and test. Maybe you should consider Kafka Streams instead
● When implement your own transformation, keep each transformation implementation
separate rather than have a single transformation class that does a bunch of things.
Don’ts

Kafka meetup - kafka connect

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka meetup - kafka connect

Similar to Kafka meetup - kafka connect (20)

Recently uploaded

Recently uploaded (20)

Kafka meetup - kafka connect