SlideShare a Scribd company logo
1 of 53
Download to read offline
| © Copyright 2022, InfluxData
1
Maximizing Real-Time
Data Processing with
Apache Kafka and
InfluxDB: A
Comprehensive Guide
Zoe Steinkamp
| © Copyright 2022, InfluxData
2
Zoe Steinkamp
Developer Advocate - InfluxData
2
LinkedIn
| © Copyright 2022, InfluxData
3 | © Copyright 2022, InfluxData
3
Agenda
● Intro to time series databases
● Introduction to InfluxDB V3
● Intoduction to Kafka + Telegraf
● Company Examples
● How Confluent supports InfluxDB
● Resources
| © Copyright 2022, InfluxData
4
Why a time series database is
important
4
| © Copyright 2023, InfluxData
5
Time series data
5
When Where
from sources
Sources, networks,
infrastructure & applications
Based on time
Hours, minutes, seconds &
nanoseconds
| © Copyright 2023, InfluxData
6
Types of time series data
Quantitative values
collected regularly
over time
State changes or
values generated
irregularly over
time
Complete event or
request
propagation in a
distributed system
Metrics Events Traces
| © Copyright 2023, InfluxData
7
Rise of time series as a category
7
TIME SERIES
RELATIONAL DOCUMENT SEARCH
• Distributed
search
• Logs
• Geo
• High
throughput
• Large
document
• Orders
• Customers
• Records
• Events, metrics, time-stamped
• For IoT, analytics, cloud native
Time series is fastest growing
data category by far
Time series
All others
source: DB Engines
| © Copyright 2022, InfluxData
8
Use Cases - Time Series
Warehouse IoT
Devices
Time series databases address a few major
issues.
Ingest - A high amount of data streaming in
at nano second precision
Compression - The ability to store this large
data set without breaking the bank
Cardinality - The need to store wide rows,
timestamped data with multiple values
Querying on Time - Instead of indexes or
values, querying on time
Track and
monitor website
infrastructure
Metrics from
everywhere
Robotics and
Green Energy
| © Copyright 2023, InfluxData
9
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
<measurement>[,<tag-key>=<tag-value>]
[<field-key>=<field-value>]
[unix-nano-timestamp]
Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/
Tag Set
,hostname=server02,us_west=az
Measurement
server
Field Set
cpu=24.5,mem=12.4
Timestamp
1234567890000000
Data Storage in Time-Series DB’s
| © Copyright 2023, InfluxData
10
InfluxDB 3.0
10
| © Copyright 2023, InfluxData
11
Typical Architecture & Deployment
| © Copyright 2023, InfluxData
12
InfluxDB 3.0
Schema on write
Write and query millions of rows
per second
Single datastore for all time series
data (Metrics, Logs & Traces)
SQL Support
| © Copyright 2023, InfluxData
13
Apache Arrow is…
13
“Apache Arrow is a framework for defining in-memory columnar data that every
processing engine can use.”
● Language-agnostic standard for
columnar memory
● Efficient for running large analytical
workloads on modern CPU and
GPU architectures.
Defragmenting Data Access Across Systems
| © Copyright 2023, InfluxData
15
Apache Parquet…
15
“Apache Parquet is an open source, column-oriented data file format designed
for efficient data storage and retrieval.”
● Not a runtime in-memory format
● Parquet data cannot be directly operated on but must be
decoded in large chunks
What is the difference between Apache
Arrow and Apache Parquet?
● Minimize disk usage while storing
gigabytes of data
● Efficient retrieval and deserialization of
large amounts of columnar data
The benefits
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
Toolkit for a modern analytic system (OLAP database)?
match tool_needed {
Productive Systems Langage => Rust,
File format (persistence) => Parquet,
Columnar memory representation => Arrow Arrays,
Operations (e.g. multiply, avg) => Compute Kernels,
SQL + extensible query engine => Arrow DataFusion,
Network transfer => Arrow Flight RPC,
JDBC/ODBC driver => Arrow FlightSQL,
}
| © Copyright 2023, InfluxData
17
InfluxDB vs Kafka
17
| © Copyright 2023, InfluxData
18
Kafka vs InfluxDB
• Data Model: Kafka is primarily a real-time, distributed streaming platform
that handles messages, whereas InfluxDB is a time-series database
designed for fast read and write performance for time-stamped data.
• Use Cases: Kafka is used for building real-time data pipelines, stream
processing, and real-time analytics. InfluxDB is optimized for monitoring,
event data storage, and real-time analytics specifically on time-series data.
• Data Durability and Scalability: Kafka distributes data across a cluster of
servers for high availability and fault-tolerance. InfluxDB also provides
clustering but is more focused on high-speed read and write operations.
• Query Language: Kafka does not have a native query language; data is
often consumed using stream processing frameworks like Kafka Streams or
Apache Flink. InfluxDB uses its query language, InfluxQL or SQL, to interact
with data.
| © Copyright 2022, InfluxData
19
Telegraf + Kafka
19
| © Copyright 2023, InfluxData
20
Telegraf
The open-source
agent for
collecting metrics
Driven by the
community (600+
contributors)
Simple to
configure,
extremely flexible
| © Copyright 2023, InfluxData
21
Categories of Telegraf Plugins
21
Logging
Databases
Networking
Industrial IoT
Web/Streaming
Gaming / Entertainment
Consumer IoT
Containers
Cloud
AMQP
MQTT
KNX
OPC-UA
Modbus
InfluxDB Listener
MySQL
Cassandra
Docker
Kubernetes
Podman
NGINX
Apache
Cloudflare
Syslog
File/Tail
OpenTelemetry
AWS Cloudwatch
Google Cloud Pub Sub
NVIDIA SMI
AMD
SNMP
GNMI
Minecraft
CS:Go
Jolokia
TCP/UDP
Listener
Elasticsearch
MongoDB
Kafka
HiveMQ
| © Copyright 2023, InfluxData
22
Kafka vs Telegraf
• Core Functionality: Kafka is a distributed streaming platform used for building
real-time data pipelines and streaming applications. Telegraf is an agent for
collecting and reporting metrics and data, usually feeding that data into InfluxDB or
other data stores.
• Data Flow: Kafka generally acts as a broker for streaming data between producers
and consumers, enabling real-time processing and analysis. Telegraf is primarily
focused on gathering metrics from a variety of sources and sending them to
specified output plugins like databases.
• Configuration: Kafka is often configured through a broker-based setup and can
involve Zookeeper for older versions. Telegraf is typically easier to set up and
configure, often through a single configuration file where you specify input and
output plugins.
• Use Cases: Kafka is used for a broad set of real-time analytics use cases, including
activity tracking, aggregating data from different sources, or acting as a buffer to
handle bursts in data loads. Telegraf is commonly used for collecting performance
metrics from systems, databases, or applications and sending them to time-series
databases like InfluxDB for monitoring.
| © Copyright 2023, InfluxData
23
Kafka Consumer Input Plugin
The Kafka consumer plugin in this context
reads data from Kafka and generates metrics
using supported input data formats.
• This plugin operates as a service input,
which means it continuously listens and
waits for metrics or events to be generated
• There are global and plugin configuration
options available which allow you to
modify metrics, tags, create aliases,
configure ordering, and more.
• There are secret-stores to handle sensitive
information like sasl_username,
sasl_password, and sasl_access_token.
| © Copyright 2023, InfluxData
24
Options for Data Formats
Avro
Binary
Collectd
CSV
Dropwizard
Graphite
Grok
InfluxDB Line
Protocol
JSON
JSON v2
Logfmt
Nagios
Prometheus
PrometheusRemote
Write
Value, ie: 45 or
"booyah"
Wavefront
XPath (supports
XML, JSON,
MessagePack,
Protocol Buffers)
| © Copyright 2023, InfluxData
25
Kafka Output Plugin
Kafka Output Plugin acts as a Kafka
Producer for writing data to Kafka Brokers.
• It supports global and plugin-specific
configuration settings for modifying
metrics, tags, and fields.
• The plugin provides secret-store
support, allowing secure handling of
options like sasl_username,
sasl_password, and
sasl_access_token.
| © Copyright 2023, InfluxData
26
Kafka Outputs Example
| © Copyright 2022, InfluxData
27
Telegraf + Kafka Project
27
| © Copyright 2023, InfluxData
28
Project Outline
| © Copyright 2023, InfluxData
29
| © Copyright 2023, InfluxData
30
Repo overview
● app/garden_sensor_gateway.py: is a Python Script uses the KafkaProducer class from
the kafka package to send generated garden sensor data to a Kafka topic. It includes
random humidity, temperature, wind, and soil data.
● app/Dockerfile: creates a container that runs garden_sensor_gateway.py
● resources/docker-compose: creates the containers the kafka, zookeeper, telegraf and
garden_sensor_gateway containers
● resources/mytelegraf.conf: contains the telegraf configuration to subscribe to the
kafka topic and write the garden data to InfluxDB Cloud v3.
| © Copyright 2023, InfluxData
31
mytelegraf.conf
[[inputs.kafka_consumer]]
## Kafka brokers.
brokers = ["kafka:9092"]
## Topics to consume.
topics = ["garden_sensor_data"]
| © Copyright 2023, InfluxData
32
mytelegraf.conf
[[outputs.influxdb_v2]]
## The URLs of the InfluxDB cluster nodes.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
## ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
urls = ["https://us-east-1-1.aws.cloud2.influxdata.com/"]
## API token for authentication.
token = "$INFLUX_TOKEN"
## Organization is the name of the organization you wish to write to; must
exist.
organization = "$INLUX_ORG"
## Destination bucket to write into.
bucket = "$INFLUX_BUCKET"
| © Copyright 2023, InfluxData
33
garden_sensor_gateway.py
● Data Generation: Functions like random_temp_cels, random_humidity,
random_wind, and random_soil generate random sensor data for temperature, humidity,
wind speed, and soil moisture, respectively. These values are rounded to one decimal place.
● JSON Packaging: The get_json_data function aggregates these random sensor
readings into a JSON-formatted string. The keys are the types of readings ("temperature,"
"humidity," etc.), and the values are the generated random numbers.
● Kafka Producer: In the main function, a KafkaProducer is initialized to send messages to
a Kafka broker running on kafka:9092. The producer sends 20,000 messages to a topic
called 'garden_sensor_data'.
● Message Sending Loop: The loop in main continuously creates JSON-formatted sensor
data with get_json_data, sends it to the Kafka topic, and prints a message to the
console. The loop pauses for 5 seconds between each iteration.
| © Copyright 2023, InfluxData
34
| © Copyright 2022, InfluxData
35
Confluent + InfluxDB
35
| © Copyright 2022, InfluxData
36
Confluent options V1
V1 Sink
At least once delivery
Dead Letter Queue
Multiple tasks
V1 Source
adding measurements from the
database dynamically
whitelists and blacklists
varying polling intervals
| © Copyright 2022, InfluxData
37
Confluent options V2
At least once delivery: This
connector guarantees that
records from the Kafka topic are
delivered at least once.
Supports multiple tasks: The
connector supports running one
or more tasks. More tasks may
improve performance.
| © Copyright 2022, InfluxData
38
What about V3? Hopefully Coming soon!
Our OSS V3 will be coming in
the end of this year. And we
hope there will be a new
confluent V3 sink and source in
Q1 of next year. But right now V1
source and V2 sink work with
V3.
| © Copyright 2022, InfluxData
39
A Blog going over
integration influxdb +
Confluent. Plus going
over the same with
MQTT and
comparing the two
streaming options for
an IOT use case.
Includes Code
examples.
| © Copyright 2022, InfluxData
40
Company Examples
40
| © Copyright 2022, InfluxData
41
Hulu reduced their
tangled celtic knot
stream to a stable,
fast, and durable
stream with Kafka
and InfluxDB and
solved the problems
that come with
Traditional
Publish-Subscribe
models. Hulu used
Kafka and InfluxDB to
scale to over 1M
metrics per second.
| © Copyright 2022, InfluxData
42
In Wayfair’s architecture,
Kafka is sandwiched
between Telegraf agents. An
Output Kafka Telegraf agent
pipes metrics from their
application to Kafka and
then the Kafka-Consumer
Telegraf agent collects those
metrics from Kafka and
sends them to InfluxDB. This
model enables Wayfair to
connect to multiple data
centers, inject processing
hooks ad hoc, and gain
multi-day tolerance against
severe outages.
| © Copyright 2023, InfluxData
43
Grafana + InfluxDB
43
| © Copyright 2023, InfluxData
44
Select Data
sources
Configure Grafana Data Source
| © Copyright 2023, InfluxData
45
Configure Grafana Data Source
TOP TIP!
Make sure to remove
the protocol from the
URL and add the port.
TOP TIP!
Make sure your token
has read permissions
enabled!
| © Copyright 2023, InfluxData
46
Navigate to
transform
and
partition
values:
Select
Time
Series
SELECT time, fuel, "generatorID"
FROM "genData"
WHERE $__timeRange(time)
Basic Visualization
TOP TIP!
$__timeRange(time) is a
Grafana global variable allowing
you to make your time ranges
dynamic.
There are others you can
discover via query helper.
| © Copyright 2023, InfluxData
47
| © Copyright 2023, InfluxData
48
Try it yourself
https://github.com/InfluxCommunity/InfluxDB-IOx-Quick-Starts
| © Copyright 2022, InfluxData
49
Resources
49
| © Copyright 2022, InfluxData
50 | © Copyright 2022, InfluxData
Project QR codes
50
Kafka Input plugin
Kafka output
plugin
Kafka + Telegraf
Project
Confluent Project
| © Copyright 2022, InfluxData
51
Try It Yourself
https://www.influxdata.com https://github.com/InfluxCommunity
| © Copyright 2022, InfluxData
52
Further Resources
Get started: influxdata.com/cloud
Slack: influxcommunity.slack.com
GitHub: github.com/InfluxCommunity
Docs: docs.influxdata.com
Blogs: influxdata.com/blog
InfluxDB University: influxdata.com/university
| © Copyright 2022, InfluxData
53
T H A N K Y O U

More Related Content

Similar to Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide

OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline Development
Timothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 

Similar to Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide (20)

Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaIntro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
 
InfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...
Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...
Jess Ingrassellino [InfluxData] | How to Get Data Into InfluxDB | InfluxDays ...
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Intro to InfluxDB
Intro to InfluxDBIntro to InfluxDB
Intro to InfluxDB
 
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
Introduction to InfluxDB 2.0 & Your First Flux Query by Sonia Gupta, Develope...
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline Development
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 

Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide

  • 1. | © Copyright 2022, InfluxData 1 Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Comprehensive Guide Zoe Steinkamp
  • 2. | © Copyright 2022, InfluxData 2 Zoe Steinkamp Developer Advocate - InfluxData 2 LinkedIn
  • 3. | © Copyright 2022, InfluxData 3 | © Copyright 2022, InfluxData 3 Agenda ● Intro to time series databases ● Introduction to InfluxDB V3 ● Intoduction to Kafka + Telegraf ● Company Examples ● How Confluent supports InfluxDB ● Resources
  • 4. | © Copyright 2022, InfluxData 4 Why a time series database is important 4
  • 5. | © Copyright 2023, InfluxData 5 Time series data 5 When Where from sources Sources, networks, infrastructure & applications Based on time Hours, minutes, seconds & nanoseconds
  • 6. | © Copyright 2023, InfluxData 6 Types of time series data Quantitative values collected regularly over time State changes or values generated irregularly over time Complete event or request propagation in a distributed system Metrics Events Traces
  • 7. | © Copyright 2023, InfluxData 7 Rise of time series as a category 7 TIME SERIES RELATIONAL DOCUMENT SEARCH • Distributed search • Logs • Geo • High throughput • Large document • Orders • Customers • Records • Events, metrics, time-stamped • For IoT, analytics, cloud native Time series is fastest growing data category by far Time series All others source: DB Engines
  • 8. | © Copyright 2022, InfluxData 8 Use Cases - Time Series Warehouse IoT Devices Time series databases address a few major issues. Ingest - A high amount of data streaming in at nano second precision Compression - The ability to store this large data set without breaking the bank Cardinality - The need to store wide rows, timestamped data with multiple values Querying on Time - Instead of indexes or values, querying on time Track and monitor website infrastructure Metrics from everywhere Robotics and Green Energy
  • 9. | © Copyright 2023, InfluxData 9 Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/ <measurement>[,<tag-key>=<tag-value>] [<field-key>=<field-value>] [unix-nano-timestamp] Reference: https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/ Tag Set ,hostname=server02,us_west=az Measurement server Field Set cpu=24.5,mem=12.4 Timestamp 1234567890000000 Data Storage in Time-Series DB’s
  • 10. | © Copyright 2023, InfluxData 10 InfluxDB 3.0 10
  • 11. | © Copyright 2023, InfluxData 11 Typical Architecture & Deployment
  • 12. | © Copyright 2023, InfluxData 12 InfluxDB 3.0 Schema on write Write and query millions of rows per second Single datastore for all time series data (Metrics, Logs & Traces) SQL Support
  • 13. | © Copyright 2023, InfluxData 13 Apache Arrow is… 13 “Apache Arrow is a framework for defining in-memory columnar data that every processing engine can use.” ● Language-agnostic standard for columnar memory ● Efficient for running large analytical workloads on modern CPU and GPU architectures.
  • 14. Defragmenting Data Access Across Systems
  • 15. | © Copyright 2023, InfluxData 15 Apache Parquet… 15 “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.” ● Not a runtime in-memory format ● Parquet data cannot be directly operated on but must be decoded in large chunks What is the difference between Apache Arrow and Apache Parquet? ● Minimize disk usage while storing gigabytes of data ● Efficient retrieval and deserialization of large amounts of columnar data The benefits https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
  • 16. Toolkit for a modern analytic system (OLAP database)? match tool_needed { Productive Systems Langage => Rust, File format (persistence) => Parquet, Columnar memory representation => Arrow Arrays, Operations (e.g. multiply, avg) => Compute Kernels, SQL + extensible query engine => Arrow DataFusion, Network transfer => Arrow Flight RPC, JDBC/ODBC driver => Arrow FlightSQL, }
  • 17. | © Copyright 2023, InfluxData 17 InfluxDB vs Kafka 17
  • 18. | © Copyright 2023, InfluxData 18 Kafka vs InfluxDB • Data Model: Kafka is primarily a real-time, distributed streaming platform that handles messages, whereas InfluxDB is a time-series database designed for fast read and write performance for time-stamped data. • Use Cases: Kafka is used for building real-time data pipelines, stream processing, and real-time analytics. InfluxDB is optimized for monitoring, event data storage, and real-time analytics specifically on time-series data. • Data Durability and Scalability: Kafka distributes data across a cluster of servers for high availability and fault-tolerance. InfluxDB also provides clustering but is more focused on high-speed read and write operations. • Query Language: Kafka does not have a native query language; data is often consumed using stream processing frameworks like Kafka Streams or Apache Flink. InfluxDB uses its query language, InfluxQL or SQL, to interact with data.
  • 19. | © Copyright 2022, InfluxData 19 Telegraf + Kafka 19
  • 20. | © Copyright 2023, InfluxData 20 Telegraf The open-source agent for collecting metrics Driven by the community (600+ contributors) Simple to configure, extremely flexible
  • 21. | © Copyright 2023, InfluxData 21 Categories of Telegraf Plugins 21 Logging Databases Networking Industrial IoT Web/Streaming Gaming / Entertainment Consumer IoT Containers Cloud AMQP MQTT KNX OPC-UA Modbus InfluxDB Listener MySQL Cassandra Docker Kubernetes Podman NGINX Apache Cloudflare Syslog File/Tail OpenTelemetry AWS Cloudwatch Google Cloud Pub Sub NVIDIA SMI AMD SNMP GNMI Minecraft CS:Go Jolokia TCP/UDP Listener Elasticsearch MongoDB Kafka HiveMQ
  • 22. | © Copyright 2023, InfluxData 22 Kafka vs Telegraf • Core Functionality: Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. Telegraf is an agent for collecting and reporting metrics and data, usually feeding that data into InfluxDB or other data stores. • Data Flow: Kafka generally acts as a broker for streaming data between producers and consumers, enabling real-time processing and analysis. Telegraf is primarily focused on gathering metrics from a variety of sources and sending them to specified output plugins like databases. • Configuration: Kafka is often configured through a broker-based setup and can involve Zookeeper for older versions. Telegraf is typically easier to set up and configure, often through a single configuration file where you specify input and output plugins. • Use Cases: Kafka is used for a broad set of real-time analytics use cases, including activity tracking, aggregating data from different sources, or acting as a buffer to handle bursts in data loads. Telegraf is commonly used for collecting performance metrics from systems, databases, or applications and sending them to time-series databases like InfluxDB for monitoring.
  • 23. | © Copyright 2023, InfluxData 23 Kafka Consumer Input Plugin The Kafka consumer plugin in this context reads data from Kafka and generates metrics using supported input data formats. • This plugin operates as a service input, which means it continuously listens and waits for metrics or events to be generated • There are global and plugin configuration options available which allow you to modify metrics, tags, create aliases, configure ordering, and more. • There are secret-stores to handle sensitive information like sasl_username, sasl_password, and sasl_access_token.
  • 24. | © Copyright 2023, InfluxData 24 Options for Data Formats Avro Binary Collectd CSV Dropwizard Graphite Grok InfluxDB Line Protocol JSON JSON v2 Logfmt Nagios Prometheus PrometheusRemote Write Value, ie: 45 or "booyah" Wavefront XPath (supports XML, JSON, MessagePack, Protocol Buffers)
  • 25. | © Copyright 2023, InfluxData 25 Kafka Output Plugin Kafka Output Plugin acts as a Kafka Producer for writing data to Kafka Brokers. • It supports global and plugin-specific configuration settings for modifying metrics, tags, and fields. • The plugin provides secret-store support, allowing secure handling of options like sasl_username, sasl_password, and sasl_access_token.
  • 26. | © Copyright 2023, InfluxData 26 Kafka Outputs Example
  • 27. | © Copyright 2022, InfluxData 27 Telegraf + Kafka Project 27
  • 28. | © Copyright 2023, InfluxData 28 Project Outline
  • 29. | © Copyright 2023, InfluxData 29
  • 30. | © Copyright 2023, InfluxData 30 Repo overview ● app/garden_sensor_gateway.py: is a Python Script uses the KafkaProducer class from the kafka package to send generated garden sensor data to a Kafka topic. It includes random humidity, temperature, wind, and soil data. ● app/Dockerfile: creates a container that runs garden_sensor_gateway.py ● resources/docker-compose: creates the containers the kafka, zookeeper, telegraf and garden_sensor_gateway containers ● resources/mytelegraf.conf: contains the telegraf configuration to subscribe to the kafka topic and write the garden data to InfluxDB Cloud v3.
  • 31. | © Copyright 2023, InfluxData 31 mytelegraf.conf [[inputs.kafka_consumer]] ## Kafka brokers. brokers = ["kafka:9092"] ## Topics to consume. topics = ["garden_sensor_data"]
  • 32. | © Copyright 2023, InfluxData 32 mytelegraf.conf [[outputs.influxdb_v2]] ## The URLs of the InfluxDB cluster nodes. ## ## Multiple URLs can be specified for a single cluster, only ONE of the ## urls will be written to each interval. ## ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"] urls = ["https://us-east-1-1.aws.cloud2.influxdata.com/"] ## API token for authentication. token = "$INFLUX_TOKEN" ## Organization is the name of the organization you wish to write to; must exist. organization = "$INLUX_ORG" ## Destination bucket to write into. bucket = "$INFLUX_BUCKET"
  • 33. | © Copyright 2023, InfluxData 33 garden_sensor_gateway.py ● Data Generation: Functions like random_temp_cels, random_humidity, random_wind, and random_soil generate random sensor data for temperature, humidity, wind speed, and soil moisture, respectively. These values are rounded to one decimal place. ● JSON Packaging: The get_json_data function aggregates these random sensor readings into a JSON-formatted string. The keys are the types of readings ("temperature," "humidity," etc.), and the values are the generated random numbers. ● Kafka Producer: In the main function, a KafkaProducer is initialized to send messages to a Kafka broker running on kafka:9092. The producer sends 20,000 messages to a topic called 'garden_sensor_data'. ● Message Sending Loop: The loop in main continuously creates JSON-formatted sensor data with get_json_data, sends it to the Kafka topic, and prints a message to the console. The loop pauses for 5 seconds between each iteration.
  • 34. | © Copyright 2023, InfluxData 34
  • 35. | © Copyright 2022, InfluxData 35 Confluent + InfluxDB 35
  • 36. | © Copyright 2022, InfluxData 36 Confluent options V1 V1 Sink At least once delivery Dead Letter Queue Multiple tasks V1 Source adding measurements from the database dynamically whitelists and blacklists varying polling intervals
  • 37. | © Copyright 2022, InfluxData 37 Confluent options V2 At least once delivery: This connector guarantees that records from the Kafka topic are delivered at least once. Supports multiple tasks: The connector supports running one or more tasks. More tasks may improve performance.
  • 38. | © Copyright 2022, InfluxData 38 What about V3? Hopefully Coming soon! Our OSS V3 will be coming in the end of this year. And we hope there will be a new confluent V3 sink and source in Q1 of next year. But right now V1 source and V2 sink work with V3.
  • 39. | © Copyright 2022, InfluxData 39 A Blog going over integration influxdb + Confluent. Plus going over the same with MQTT and comparing the two streaming options for an IOT use case. Includes Code examples.
  • 40. | © Copyright 2022, InfluxData 40 Company Examples 40
  • 41. | © Copyright 2022, InfluxData 41 Hulu reduced their tangled celtic knot stream to a stable, fast, and durable stream with Kafka and InfluxDB and solved the problems that come with Traditional Publish-Subscribe models. Hulu used Kafka and InfluxDB to scale to over 1M metrics per second.
  • 42. | © Copyright 2022, InfluxData 42 In Wayfair’s architecture, Kafka is sandwiched between Telegraf agents. An Output Kafka Telegraf agent pipes metrics from their application to Kafka and then the Kafka-Consumer Telegraf agent collects those metrics from Kafka and sends them to InfluxDB. This model enables Wayfair to connect to multiple data centers, inject processing hooks ad hoc, and gain multi-day tolerance against severe outages.
  • 43. | © Copyright 2023, InfluxData 43 Grafana + InfluxDB 43
  • 44. | © Copyright 2023, InfluxData 44 Select Data sources Configure Grafana Data Source
  • 45. | © Copyright 2023, InfluxData 45 Configure Grafana Data Source TOP TIP! Make sure to remove the protocol from the URL and add the port. TOP TIP! Make sure your token has read permissions enabled!
  • 46. | © Copyright 2023, InfluxData 46 Navigate to transform and partition values: Select Time Series SELECT time, fuel, "generatorID" FROM "genData" WHERE $__timeRange(time) Basic Visualization TOP TIP! $__timeRange(time) is a Grafana global variable allowing you to make your time ranges dynamic. There are others you can discover via query helper.
  • 47. | © Copyright 2023, InfluxData 47
  • 48. | © Copyright 2023, InfluxData 48 Try it yourself https://github.com/InfluxCommunity/InfluxDB-IOx-Quick-Starts
  • 49. | © Copyright 2022, InfluxData 49 Resources 49
  • 50. | © Copyright 2022, InfluxData 50 | © Copyright 2022, InfluxData Project QR codes 50 Kafka Input plugin Kafka output plugin Kafka + Telegraf Project Confluent Project
  • 51. | © Copyright 2022, InfluxData 51 Try It Yourself https://www.influxdata.com https://github.com/InfluxCommunity
  • 52. | © Copyright 2022, InfluxData 52 Further Resources Get started: influxdata.com/cloud Slack: influxcommunity.slack.com GitHub: github.com/InfluxCommunity Docs: docs.influxdata.com Blogs: influxdata.com/blog InfluxDB University: influxdata.com/university
  • 53. | © Copyright 2022, InfluxData 53 T H A N K Y O U