Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide

| © Copyright 2022, InﬂuxData
1
Maximizing Real-Time
Data Processing with
Apache Kafka and
InﬂuxDB: A
Comprehensive Guide
Zoe Steinkamp

2
Zoe Steinkamp
Developer Advocate - InﬂuxData
2
LinkedIn

3 | © Copyright 2022, InfluxData
3
Agenda
● Intro to time series databases
● Introduction to InfluxDB V3
● Intoduction to Kafka + Telegraf
● Company Examples
● How Confluent supports InfluxDB
● Resources

4
Why a time series database is
important
4

5
Time series data
5
When Where
from sources
Sources, networks,
infrastructure & applications
Based on time
Hours, minutes, seconds &
nanoseconds

6
Types of time series data
Quantitative values
collected regularly
over time
State changes or
values generated
irregularly over
time
Complete event or
request
propagation in a
distributed system
Metrics Events Traces

7
Rise of time series as a category
7
TIME SERIES
RELATIONAL DOCUMENT SEARCH
• Distributed
search
• Logs
• Geo
• High
throughput
• Large
document
• Orders
• Customers
• Records
• Events, metrics, time-stamped
• For IoT, analytics, cloud native
Time series is fastest growing
data category by far
Time series
All others
source: DB Engines

8
Use Cases - Time Series
Warehouse IoT
Devices
Time series databases address a few major
issues.
Ingest - A high amount of data streaming in
at nano second precision
Compression - The ability to store this large
data set without breaking the bank
Cardinality - The need to store wide rows,
timestamped data with multiple values
Querying on Time - Instead of indexes or
values, querying on time
Track and
monitor website
infrastructure
Metrics from
everywhere
Robotics and
Green Energy

9
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
<measurement>[,<tag-key>=<tag-value>]
[<field-key>=<field-value>]
[unix-nano-timestamp]
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
Tag Set
,hostname=server02,us_west=az
Measurement
server
Field Set
cpu=24.5,mem=12.4
Timestamp
1234567890000000
Data Storage in Time-Series DB’s

10
InﬂuxDB 3.0
10

11
Typical Architecture & Deployment

12
InﬂuxDB 3.0
Schema on write
Write and query millions of rows
per second
Single datastore for all time series
data (Metrics, Logs & Traces)
SQL Support

13
Apache Arrow is…
13
“Apache Arrow is a framework for deﬁning in-memory columnar data that every
processing engine can use.”
● Language-agnostic standard for
columnar memory
● Efficient for running large analytical
workloads on modern CPU and
GPU architectures.

Defragmenting Data Access Across Systems

15
Apache Parquet…
15
“Apache Parquet is an open source, column-oriented data file format designed
for efficient data storage and retrieval.”
● Not a runtime in-memory format
● Parquet data cannot be directly operated on but must be
decoded in large chunks
What is the difference between Apache
Arrow and Apache Parquet?
● Minimize disk usage while storing
gigabytes of data
● Efficient retrieval and deserialization of
large amounts of columnar data
The benefits
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and

Toolkit for a modern analytic system (OLAP database)?
match tool_needed {
Productive Systems Langage => Rust,
File format (persistence) => Parquet,
Columnar memory representation => Arrow Arrays,
Operations (e.g. multiply, avg) => Compute Kernels,
SQL + extensible query engine => Arrow DataFusion,
Network transfer => Arrow Flight RPC,
JDBC/ODBC driver => Arrow FlightSQL,
}

17
InﬂuxDB vs Kafka
17

18
Kafka vs InfluxDB
• Data Model: Kafka is primarily a real-time, distributed streaming platform
that handles messages, whereas InfluxDB is a time-series database
designed for fast read and write performance for time-stamped data.
• Use Cases: Kafka is used for building real-time data pipelines, stream
processing, and real-time analytics. InfluxDB is optimized for monitoring,
event data storage, and real-time analytics specifically on time-series data.
• Data Durability and Scalability: Kafka distributes data across a cluster of
servers for high availability and fault-tolerance. InfluxDB also provides
clustering but is more focused on high-speed read and write operations.
• Query Language: Kafka does not have a native query language; data is
often consumed using stream processing frameworks like Kafka Streams or
Apache Flink. InfluxDB uses its query language, InfluxQL or SQL, to interact
with data.

19
Telegraf + Kafka
19

20
Telegraf
The open-source
agent for
collecting metrics
Driven by the
community (600+
contributors)
Simple to
conﬁgure,
extremely ﬂexible

21
Categories of Telegraf Plugins
21
Logging
Databases
Networking
Industrial IoT
Web/Streaming
Gaming / Entertainment
Consumer IoT
Containers
Cloud
AMQP
MQTT
KNX
OPC-UA
Modbus
InﬂuxDB Listener
MySQL
Cassandra
Docker
Kubernetes
Podman
NGINX
Apache
Cloudﬂare
Syslog
File/Tail
OpenTelemetry
AWS Cloudwatch
Google Cloud Pub Sub
NVIDIA SMI
AMD
SNMP
GNMI
Minecraft
CS:Go
Jolokia
TCP/UDP
Listener
Elasticsearch
MongoDB
Kafka
HiveMQ

22
Kafka vs Telegraf
• Core Functionality: Kafka is a distributed streaming platform used for building
real-time data pipelines and streaming applications. Telegraf is an agent for
collecting and reporting metrics and data, usually feeding that data into InfluxDB or
other data stores.
• Data Flow: Kafka generally acts as a broker for streaming data between producers
and consumers, enabling real-time processing and analysis. Telegraf is primarily
focused on gathering metrics from a variety of sources and sending them to
specified output plugins like databases.
• Configuration: Kafka is often configured through a broker-based setup and can
involve Zookeeper for older versions. Telegraf is typically easier to set up and
configure, often through a single configuration file where you specify input and
output plugins.
• Use Cases: Kafka is used for a broad set of real-time analytics use cases, including
activity tracking, aggregating data from different sources, or acting as a buffer to
handle bursts in data loads. Telegraf is commonly used for collecting performance
metrics from systems, databases, or applications and sending them to time-series
databases like InfluxDB for monitoring.

23
Kafka Consumer Input Plugin
The Kafka consumer plugin in this context
reads data from Kafka and generates metrics
using supported input data formats.
• This plugin operates as a service input,
which means it continuously listens and
waits for metrics or events to be generated
• There are global and plugin conﬁguration
options available which allow you to
modify metrics, tags, create aliases,
conﬁgure ordering, and more.
• There are secret-stores to handle sensitive
information like sasl_username,
sasl_password, and sasl_access_token.

24
Options for Data Formats
Avro
Binary
Collectd
CSV
Dropwizard
Graphite
Grok
InﬂuxDB Line
Protocol
JSON
JSON v2
Logfmt
Nagios
Prometheus
PrometheusRemote
Write
Value, ie: 45 or
"booyah"
Wavefront
XPath (supports
XML, JSON,
MessagePack,
Protocol Buﬀers)

25
Kafka Output Plugin
Kafka Output Plugin acts as a Kafka
Producer for writing data to Kafka Brokers.
• It supports global and plugin-specific
configuration settings for modifying
metrics, tags, and fields.
• The plugin provides secret-store
support, allowing secure handling of
options like sasl_username,
sasl_password, and
sasl_access_token.

26
Kafka Outputs Example

27
Telegraf + Kafka Project
27

28
Project Outline

29

30
Repo overview
● app/garden_sensor_gateway.py: is a Python Script uses the KafkaProducer class from
the kafka package to send generated garden sensor data to a Kafka topic. It includes
random humidity, temperature, wind, and soil data.
● app/Dockerfile: creates a container that runs garden_sensor_gateway.py
● resources/docker-compose: creates the containers the kafka, zookeeper, telegraf and
garden_sensor_gateway containers
● resources/mytelegraf.conf: contains the telegraf configuration to subscribe to the
kafka topic and write the garden data to InfluxDB Cloud v3.

31
mytelegraf.conf
[[inputs.kafka_consumer]]
## Kafka brokers.
brokers = ["kafka:9092"]
## Topics to consume.
topics = ["garden_sensor_data"]

32
mytelegraf.conf
[[outputs.influxdb_v2]]
## The URLs of the InfluxDB cluster nodes.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
## ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
urls = ["https://us-east-1-1.aws.cloud2.influxdata.com/"]
## API token for authentication.
token = "$INFLUX_TOKEN"
## Organization is the name of the organization you wish to write to; must
exist.
organization = "$INLUX_ORG"
## Destination bucket to write into.
bucket = "$INFLUX_BUCKET"

33
garden_sensor_gateway.py
● Data Generation: Functions like random_temp_cels, random_humidity,
random_wind, and random_soil generate random sensor data for temperature, humidity,
wind speed, and soil moisture, respectively. These values are rounded to one decimal place.
● JSON Packaging: The get_json_data function aggregates these random sensor
readings into a JSON-formatted string. The keys are the types of readings ("temperature,"
"humidity," etc.), and the values are the generated random numbers.
● Kafka Producer: In the main function, a KafkaProducer is initialized to send messages to
a Kafka broker running on kafka:9092. The producer sends 20,000 messages to a topic
called 'garden_sensor_data'.
● Message Sending Loop: The loop in main continuously creates JSON-formatted sensor
data with get_json_data, sends it to the Kafka topic, and prints a message to the
console. The loop pauses for 5 seconds between each iteration.

34

35
Conﬂuent + InﬂuxDB
35

36
Conﬂuent options V1
V1 Sink
At least once delivery
Dead Letter Queue
Multiple tasks
V1 Source
adding measurements from the
database dynamically
whitelists and blacklists
varying polling intervals

37
Conﬂuent options V2
At least once delivery: This
connector guarantees that
records from the Kafka topic are
delivered at least once.
Supports multiple tasks: The
connector supports running one
or more tasks. More tasks may
improve performance.

38
What about V3? Hopefully Coming soon!
Our OSS V3 will be coming in
the end of this year. And we
hope there will be a new
conﬂuent V3 sink and source in
Q1 of next year. But right now V1
source and V2 sink work with
V3.

39
A Blog going over
integration inﬂuxdb +
Conﬂuent. Plus going
over the same with
MQTT and
comparing the two
streaming options for
an IOT use case.
Includes Code
examples.

40
Company Examples
40

41
Hulu reduced their
tangled celtic knot
stream to a stable,
fast, and durable
stream with Kafka
and InﬂuxDB and
solved the problems
that come with
Traditional
Publish-Subscribe
models. Hulu used
Kafka and InﬂuxDB to
scale to over 1M
metrics per second.

42
In Wayfair’s architecture,
Kafka is sandwiched
between Telegraf agents. An
Output Kafka Telegraf agent
pipes metrics from their
application to Kafka and
then the Kafka-Consumer
Telegraf agent collects those
metrics from Kafka and
sends them to InﬂuxDB. This
model enables Wayfair to
connect to multiple data
centers, inject processing
hooks ad hoc, and gain
multi-day tolerance against
severe outages.

43
Grafana + InﬂuxDB
43

44
Select Data
sources
Conﬁgure Grafana Data Source

45
Conﬁgure Grafana Data Source
TOP TIP!
Make sure to remove
the protocol from the
URL and add the port.
TOP TIP!
Make sure your token
has read permissions
enabled!

46
Navigate to
transform
and
partition
values:
Select
Time
Series
SELECT time, fuel, "generatorID"
FROM "genData"
WHERE $__timeRange(time)
Basic Visualization
TOP TIP!
$__timeRange(time) is a
Grafana global variable allowing
you to make your time ranges
dynamic.
There are others you can
discover via query helper.

47

48
Try it yourself
https://github.com/InfluxCommunity/InfluxDB-IOx-Quick-Starts

49
Resources
49

51
Try It Yourself
https://www.inﬂuxdata.com https://github.com/InﬂuxCommunity

52
Further Resources
Get started: influxdata.com/cloud
Slack: influxcommunity.slack.com
GitHub: github.com/InfluxCommunity
Docs: docs.influxdata.com
Blogs: influxdata.com/blog
InfluxDB University: influxdata.com/university

53
T H A N K Y O U

Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide

Recommended

Recommended

More Related Content

Similar to Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide

Similar to Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide