More Related Content Similar to Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide (20) More from HostedbyConfluent (20) Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide1. | © Copyright 2022, InfluxData
1
Maximizing Real-Time
Data Processing with
Apache Kafka and
InfluxDB: A
Comprehensive Guide
Zoe Steinkamp
2. | © Copyright 2022, InfluxData
2
Zoe Steinkamp
Developer Advocate - InfluxData
2
LinkedIn
3. | © Copyright 2022, InfluxData
3 | © Copyright 2022, InfluxData
3
Agenda
● Intro to time series databases
● Introduction to InfluxDB V3
● Intoduction to Kafka + Telegraf
● Company Examples
● How Confluent supports InfluxDB
● Resources
4. | © Copyright 2022, InfluxData
4
Why a time series database is
important
4
5. | © Copyright 2023, InfluxData
5
Time series data
5
When Where
from sources
Sources, networks,
infrastructure & applications
Based on time
Hours, minutes, seconds &
nanoseconds
6. | © Copyright 2023, InfluxData
6
Types of time series data
Quantitative values
collected regularly
over time
State changes or
values generated
irregularly over
time
Complete event or
request
propagation in a
distributed system
Metrics Events Traces
7. | © Copyright 2023, InfluxData
7
Rise of time series as a category
7
TIME SERIES
RELATIONAL DOCUMENT SEARCH
• Distributed
search
• Logs
• Geo
• High
throughput
• Large
document
• Orders
• Customers
• Records
• Events, metrics, time-stamped
• For IoT, analytics, cloud native
Time series is fastest growing
data category by far
Time series
All others
source: DB Engines
8. | © Copyright 2022, InfluxData
8
Use Cases - Time Series
Warehouse IoT
Devices
Time series databases address a few major
issues.
Ingest - A high amount of data streaming in
at nano second precision
Compression - The ability to store this large
data set without breaking the bank
Cardinality - The need to store wide rows,
timestamped data with multiple values
Querying on Time - Instead of indexes or
values, querying on time
Track and
monitor website
infrastructure
Metrics from
everywhere
Robotics and
Green Energy
9. | © Copyright 2023, InfluxData
9
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
<measurement>[,<tag-key>=<tag-value>]
[<field-key>=<field-value>]
[unix-nano-timestamp]
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
Tag Set
,hostname=server02,us_west=az
Measurement
server
Field Set
cpu=24.5,mem=12.4
Timestamp
1234567890000000
Data Storage in Time-Series DB’s
11. | © Copyright 2023, InfluxData
11
Typical Architecture & Deployment
12. | © Copyright 2023, InfluxData
12
InfluxDB 3.0
Schema on write
Write and query millions of rows
per second
Single datastore for all time series
data (Metrics, Logs & Traces)
SQL Support
13. | © Copyright 2023, InfluxData
13
Apache Arrow is…
13
“Apache Arrow is a framework for defining in-memory columnar data that every
processing engine can use.”
● Language-agnostic standard for
columnar memory
● Efficient for running large analytical
workloads on modern CPU and
GPU architectures.
15. | © Copyright 2023, InfluxData
15
Apache Parquet…
15
“Apache Parquet is an open source, column-oriented data file format designed
for efficient data storage and retrieval.”
● Not a runtime in-memory format
● Parquet data cannot be directly operated on but must be
decoded in large chunks
What is the difference between Apache
Arrow and Apache Parquet?
● Minimize disk usage while storing
gigabytes of data
● Efficient retrieval and deserialization of
large amounts of columnar data
The benefits
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
16. Toolkit for a modern analytic system (OLAP database)?
match tool_needed {
Productive Systems Langage => Rust,
File format (persistence) => Parquet,
Columnar memory representation => Arrow Arrays,
Operations (e.g. multiply, avg) => Compute Kernels,
SQL + extensible query engine => Arrow DataFusion,
Network transfer => Arrow Flight RPC,
JDBC/ODBC driver => Arrow FlightSQL,
}
18. | © Copyright 2023, InfluxData
18
Kafka vs InfluxDB
• Data Model: Kafka is primarily a real-time, distributed streaming platform
that handles messages, whereas InfluxDB is a time-series database
designed for fast read and write performance for time-stamped data.
• Use Cases: Kafka is used for building real-time data pipelines, stream
processing, and real-time analytics. InfluxDB is optimized for monitoring,
event data storage, and real-time analytics specifically on time-series data.
• Data Durability and Scalability: Kafka distributes data across a cluster of
servers for high availability and fault-tolerance. InfluxDB also provides
clustering but is more focused on high-speed read and write operations.
• Query Language: Kafka does not have a native query language; data is
often consumed using stream processing frameworks like Kafka Streams or
Apache Flink. InfluxDB uses its query language, InfluxQL or SQL, to interact
with data.
20. | © Copyright 2023, InfluxData
20
Telegraf
The open-source
agent for
collecting metrics
Driven by the
community (600+
contributors)
Simple to
configure,
extremely flexible
21. | © Copyright 2023, InfluxData
21
Categories of Telegraf Plugins
21
Logging
Databases
Networking
Industrial IoT
Web/Streaming
Gaming / Entertainment
Consumer IoT
Containers
Cloud
AMQP
MQTT
KNX
OPC-UA
Modbus
InfluxDB Listener
MySQL
Cassandra
Docker
Kubernetes
Podman
NGINX
Apache
Cloudflare
Syslog
File/Tail
OpenTelemetry
AWS Cloudwatch
Google Cloud Pub Sub
NVIDIA SMI
AMD
SNMP
GNMI
Minecraft
CS:Go
Jolokia
TCP/UDP
Listener
Elasticsearch
MongoDB
Kafka
HiveMQ
22. | © Copyright 2023, InfluxData
22
Kafka vs Telegraf
• Core Functionality: Kafka is a distributed streaming platform used for building
real-time data pipelines and streaming applications. Telegraf is an agent for
collecting and reporting metrics and data, usually feeding that data into InfluxDB or
other data stores.
• Data Flow: Kafka generally acts as a broker for streaming data between producers
and consumers, enabling real-time processing and analysis. Telegraf is primarily
focused on gathering metrics from a variety of sources and sending them to
specified output plugins like databases.
• Configuration: Kafka is often configured through a broker-based setup and can
involve Zookeeper for older versions. Telegraf is typically easier to set up and
configure, often through a single configuration file where you specify input and
output plugins.
• Use Cases: Kafka is used for a broad set of real-time analytics use cases, including
activity tracking, aggregating data from different sources, or acting as a buffer to
handle bursts in data loads. Telegraf is commonly used for collecting performance
metrics from systems, databases, or applications and sending them to time-series
databases like InfluxDB for monitoring.
23. | © Copyright 2023, InfluxData
23
Kafka Consumer Input Plugin
The Kafka consumer plugin in this context
reads data from Kafka and generates metrics
using supported input data formats.
• This plugin operates as a service input,
which means it continuously listens and
waits for metrics or events to be generated
• There are global and plugin configuration
options available which allow you to
modify metrics, tags, create aliases,
configure ordering, and more.
• There are secret-stores to handle sensitive
information like sasl_username,
sasl_password, and sasl_access_token.
24. | © Copyright 2023, InfluxData
24
Options for Data Formats
Avro
Binary
Collectd
CSV
Dropwizard
Graphite
Grok
InfluxDB Line
Protocol
JSON
JSON v2
Logfmt
Nagios
Prometheus
PrometheusRemote
Write
Value, ie: 45 or
"booyah"
Wavefront
XPath (supports
XML, JSON,
MessagePack,
Protocol Buffers)
25. | © Copyright 2023, InfluxData
25
Kafka Output Plugin
Kafka Output Plugin acts as a Kafka
Producer for writing data to Kafka Brokers.
• It supports global and plugin-specific
configuration settings for modifying
metrics, tags, and fields.
• The plugin provides secret-store
support, allowing secure handling of
options like sasl_username,
sasl_password, and
sasl_access_token.
30. | © Copyright 2023, InfluxData
30
Repo overview
● app/garden_sensor_gateway.py: is a Python Script uses the KafkaProducer class from
the kafka package to send generated garden sensor data to a Kafka topic. It includes
random humidity, temperature, wind, and soil data.
● app/Dockerfile: creates a container that runs garden_sensor_gateway.py
● resources/docker-compose: creates the containers the kafka, zookeeper, telegraf and
garden_sensor_gateway containers
● resources/mytelegraf.conf: contains the telegraf configuration to subscribe to the
kafka topic and write the garden data to InfluxDB Cloud v3.
31. | © Copyright 2023, InfluxData
31
mytelegraf.conf
[[inputs.kafka_consumer]]
## Kafka brokers.
brokers = ["kafka:9092"]
## Topics to consume.
topics = ["garden_sensor_data"]
32. | © Copyright 2023, InfluxData
32
mytelegraf.conf
[[outputs.influxdb_v2]]
## The URLs of the InfluxDB cluster nodes.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
## ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
urls = ["https://us-east-1-1.aws.cloud2.influxdata.com/"]
## API token for authentication.
token = "$INFLUX_TOKEN"
## Organization is the name of the organization you wish to write to; must
exist.
organization = "$INLUX_ORG"
## Destination bucket to write into.
bucket = "$INFLUX_BUCKET"
33. | © Copyright 2023, InfluxData
33
garden_sensor_gateway.py
● Data Generation: Functions like random_temp_cels, random_humidity,
random_wind, and random_soil generate random sensor data for temperature, humidity,
wind speed, and soil moisture, respectively. These values are rounded to one decimal place.
● JSON Packaging: The get_json_data function aggregates these random sensor
readings into a JSON-formatted string. The keys are the types of readings ("temperature,"
"humidity," etc.), and the values are the generated random numbers.
● Kafka Producer: In the main function, a KafkaProducer is initialized to send messages to
a Kafka broker running on kafka:9092. The producer sends 20,000 messages to a topic
called 'garden_sensor_data'.
● Message Sending Loop: The loop in main continuously creates JSON-formatted sensor
data with get_json_data, sends it to the Kafka topic, and prints a message to the
console. The loop pauses for 5 seconds between each iteration.
36. | © Copyright 2022, InfluxData
36
Confluent options V1
V1 Sink
At least once delivery
Dead Letter Queue
Multiple tasks
V1 Source
adding measurements from the
database dynamically
whitelists and blacklists
varying polling intervals
37. | © Copyright 2022, InfluxData
37
Confluent options V2
At least once delivery: This
connector guarantees that
records from the Kafka topic are
delivered at least once.
Supports multiple tasks: The
connector supports running one
or more tasks. More tasks may
improve performance.
38. | © Copyright 2022, InfluxData
38
What about V3? Hopefully Coming soon!
Our OSS V3 will be coming in
the end of this year. And we
hope there will be a new
confluent V3 sink and source in
Q1 of next year. But right now V1
source and V2 sink work with
V3.
39. | © Copyright 2022, InfluxData
39
A Blog going over
integration influxdb +
Confluent. Plus going
over the same with
MQTT and
comparing the two
streaming options for
an IOT use case.
Includes Code
examples.
41. | © Copyright 2022, InfluxData
41
Hulu reduced their
tangled celtic knot
stream to a stable,
fast, and durable
stream with Kafka
and InfluxDB and
solved the problems
that come with
Traditional
Publish-Subscribe
models. Hulu used
Kafka and InfluxDB to
scale to over 1M
metrics per second.
42. | © Copyright 2022, InfluxData
42
In Wayfair’s architecture,
Kafka is sandwiched
between Telegraf agents. An
Output Kafka Telegraf agent
pipes metrics from their
application to Kafka and
then the Kafka-Consumer
Telegraf agent collects those
metrics from Kafka and
sends them to InfluxDB. This
model enables Wayfair to
connect to multiple data
centers, inject processing
hooks ad hoc, and gain
multi-day tolerance against
severe outages.
44. | © Copyright 2023, InfluxData
44
Select Data
sources
Configure Grafana Data Source
45. | © Copyright 2023, InfluxData
45
Configure Grafana Data Source
TOP TIP!
Make sure to remove
the protocol from the
URL and add the port.
TOP TIP!
Make sure your token
has read permissions
enabled!
46. | © Copyright 2023, InfluxData
46
Navigate to
transform
and
partition
values:
Select
Time
Series
SELECT time, fuel, "generatorID"
FROM "genData"
WHERE $__timeRange(time)
Basic Visualization
TOP TIP!
$__timeRange(time) is a
Grafana global variable allowing
you to make your time ranges
dynamic.
There are others you can
discover via query helper.
48. | © Copyright 2023, InfluxData
48
Try it yourself
https://github.com/InfluxCommunity/InfluxDB-IOx-Quick-Starts
50. | © Copyright 2022, InfluxData
50 | © Copyright 2022, InfluxData
Project QR codes
50
Kafka Input plugin
Kafka output
plugin
Kafka + Telegraf
Project
Confluent Project
51. | © Copyright 2022, InfluxData
51
Try It Yourself
https://www.influxdata.com https://github.com/InfluxCommunity
52. | © Copyright 2022, InfluxData
52
Further Resources
Get started: influxdata.com/cloud
Slack: influxcommunity.slack.com
GitHub: github.com/InfluxCommunity
Docs: docs.influxdata.com
Blogs: influxdata.com/blog
InfluxDB University: influxdata.com/university