A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid

©2022, Imply
©2022, imply
The Trifecta of Real-Time Applications:
Apache Kafka, Flink, and Druid
1

©2022, Imply
Speakers
2
Gian Merlino
Co-founder and CTO of Imply
PMC Chair of Apache Druid
Kai Wähner
Field CTO of Conﬂuent

©2022, Imply 3
Agenda 1. Overview of real-time applications
2. Kafka-Flink-Druid data architecture
3. Introduction to Apache Flink+use cases
4. Introduction to Apache Druid+use cases
5. Kafka-Flink-Druid case studies
6. Conﬂuent and Imply integration
7. Q&A

©2022, Imply
©2021, imply
4
Data applications are going real-time

©2022, Imply 5
REAL-TIME APPLICATION
● >5 million events per second
● Interactive queries
● Customer-facing
● 250 queries per second
● Complex analytics
● Operational visibility

©2022, Imply 6
● Real-time monitoring
● Real-time alerting
● Real-time decisioning
● Interactive queries
● High dimensionality

©2022, Imply 7
● Real-time decisioning
● API-driven interactive queries
● Customer-facing
● 25 million users
● <100 ms query latency
● 100+ QPS

©2022, Imply 8
Building real-time apps with batch workﬂows doesn’t work
Waiting…
Data Collection
Data Processing
Data Ingestion
Data Analysis
Data Presentation
1 3
5
2 4
Waiting…
Waiting…
Waiting…
Waiting…
Latency measured in hours-to-days

©2022, Imply 9
Open source real-time data architecture
Data
Sources
Streaming Pipeline
Events/
Messages
Stream Processing
Real-Time
Analytics
Monitor/Alerting
Analytics
Visualization/
Decisioning
Real-Time
Applications
Historical Data / Context
+
Enrichment /
Transformation

©2022, Imply 10
timestamp sensor_id temperature
2023-07-10T10:00:00 SensorA 22.5
2023-07-10T10:01:00 SensorB 18.2
2023-07-10T10:02:00 SensorC 65.5
timestamp sensor_id location
2023-07-10T10:00:00 SensorA Room 101
2023-07-10T10:01:00 SensorB Room 101
2023-07-10T10:02:00 SensorC Room 101
Source
A common real-time KFD pattern
timestamp sensor_id location
tempe
rature
2023-07-10T10:00:00 SensorA Room 101 22.5
2023-07-10T10:01:00 SensorB Room 101 18.2
2023-07-10T10:02:00 SensorC Room 101 65.5
65.5
Current

©2022, Imply
©2021, imply
11
Introduction to Apache Flink

Flink growth has
mirrored the growth of
Kafka, the de facto
standard for streaming
data
>75% of the Fortune 500 estimated to
be using Kafka
>100,000+ orgs using Kafka
>41,000 Kafka meetup attendees
>750 Kafka Improvement Proposals
>12,000 Jiras for Apache Kafka
0
50,000
100,000
150,000
2020 2021 2022
2016 2017 2018
Flink
Kafka
Two Apache Projects, Born a
Few Years Apart
Monthly Unique Users

Developers choose Flink because of its
performance and rich feature set
Scalability and
Performance
Fault
Tolerance
Flink is a top 5 Apache project and boasts a robust developer community
Uniﬁed
Processing
Flink is capable of
supporting stream
processing workloads
at tremendous scale
Language
Flexibility
Flink's fault tolerance
mechanisms ensure it
can handle failures
effectively and provide
high availability
Flink supports Java,
Python, & SQL,
enabling developers to
work in their language
of choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology

Flink supports uniﬁed stream and batch processing
● Entire pipeline must always be running ● Execution proceeds in stages, running as needed
● Input must be processed as it arrives ● Input may be pre-sorted by time and key
● Results are reported as they become ready ● Results are reported at the end of the job
● Failure recovery resumes from a recent snapshot ● Failure recovery does a reset and full restart
● Flink guarantees effectively exactly-once results
despite out-of-order data and restarts due to
failures, etc.
● Effectively exactly-once guarantees are more
straightforward

Effortlessly filter, join, and enrich your data streams with Apache Flink
Real-time processing
Power low-latency applications and pipelines that react to
real-time events and provide timely insights
Data reusability
Share consistent and reusable data streams widely with
downstream applications and systems
Data enrichment
Curate, filter, and augment data on-the-fly with additional
context to improve completeness, accuracy, & compliance
Efficiency
Improve resource utilization and cost-effectiveness by
avoiding redundant processing across silos
“With Confluent’s fully managed Flink offering, we can access, aggregate, and enrich data from IoT sensors,
smart cameras, and Wi-Fi analytics, to swiftly take action on potential threats in real time, such as intrusion
detection. This enables us to process sensor data as soon as the events occur, allowing for faster detection and
response to security incidents without any added operational burden.”

Process data streams in-ﬂight to maximize actionability, ﬁdelity, and portability
Blob
storage
3rd party
app
Databases Data
Warehouse
Database
SaaS app
Low latency apps
and data pipelines
Consistent, reusable
data products
Optimized resource
utilization

Enrich real-time data streams with Generative
AI directly from Flink SQL
INSERT INTO enriched_reviews
SELECT id
, review
,
invoke_openai(prompt,review) as
score
FROM product_reviews
;
K
N
Kate
4 hours ago
This was the worst decision ever.
Nikola
1 day ago
Not bad. Could have been cheaper.
K
N
B
Kate
★★★★★ 4 hours ago
This was the worst decision ever.
Nikola
★★★★★ 1 day ago
Not bad. Could have been cheaper.
Brian
★★★★★ 3 days ago
Amazing! Game Changer!
The Prompt
“Score the following text on a scale of 1
and 5 where 1 is negative and 5 is
positive returning only the number”
DATA STREAMING PLATFORM
B
Brian
3 days ago
Amazing! Game Changer!

©2022, Imply
©2021, imply
18
Introduction to Apache Druid

©2022, Imply 19
Applications
Analytics Applications
Druid is built for
the intersection
of analytics and
applications.
Apache
Druid

©2022, Imply 20
Apache Druid is a real-time analytics database
Sub-second queries at massive scale
Interactive analytics on TB-PBs of data
High concurrency at low cost
1000s QPS via highly eﬃcient distributed query engine
Real-time and historical insights
True stream ingestion for Kafka and Kinesis
Plus, non-stop reliability with automated fault
tolerance and continuous backup
1
2
3
For analytics applications that require:
(Example of a Druid-powered UX)

©2022, Imply
Real-Time Ingestion
Ingestion Task 0
Ingestion Task 1
Ingestion Task M
…
Topic
Partition 0
Partition 1
Partition 2
Partition N
21
Druid’s stream ingestion scales right with Kafka
Producer
Producer
Producer
…
…
Query
Historical
Historical
Historical

©2022, Imply 22
22
1 2 3 4 5
Apache Druid is a database built for data in motion
Continuous backup
Guaranteed consistency
Real-time insights with
ﬂexible historical queries
Event-based ingestion
High EPS scalability
Schema auto-detection
No connector needed
to druid

Real-time decisioning:
External-facing analytics:
Operational visibility at scale: Rapid data exploration:
For use cases where instant query response powers
automated rules engines and ML frameworks, including
real-time decisions and recommendations
For use cases that require instant response on interactive,
ad-hoc queries at scale on high-dimensional data such as
root cause diagnostics, ML training, and investigation.
For use cases where analytics are being delivered to
external stakeholders as a product or as a value add with
strict SLAs for performance under load and resiliency
For use cases that require real-time insights on big,
fast-moving event streams like observability, product
analytics, clickstream, IoT, and fraud detection
A high-performance, real-time analytics database
Supply chain Logistics Healthtech
Adtech Fintech Gaming Entertainment Retail
eCommerce
Operational visibility at scale
External-facing analytics
Rapid data exploration
Real-time decisioning

Just a few examples of the 1000s of Druid users

©2022, Imply 25
Selecting the right tool for the job
Apache Druid complements your analytics stack
● High QPS application/API usage
● Strict latency requirements
● Less elastic
● Ad-hoc, low concurrency usage
● Loose latency requirements
● Highly elastic
Ideal
workload
● Scatter-gather and shuffling engines
● Leverage broad array of indexes
● Guaranteed data caching
● Shuffling engines
● Focus on sequential data access
● Opportunistic data caching
Technical
properties
Snowflake/BigQuery/Trino
Cloud Data Warehouses + Query Engines
Apache Druid
Real-Time Analytics Database

Trusted technology with an awesome community
Companies using Druid
Active Contributors
YoY Increase in Community Activity
Community Members
1,900+ 150%
14,000+ 600+

©2022, Imply 27
And many more!
The data teams at leading organizations choose Druid
Advertising Entertainment Industrial Financial
Gaming Technology/Platform Technology/SaaS Security

10+
GB /hr
3X
faster
queries
99.9%
availability
Ad Serving
Event Collector
METADATA
USER ACTIONS
Reddit Ad Serving Pipeline

30
Phone Service Amazon
Kinesis
JSON events
Batch processing
Viz tools
User
Lyft Analytics Pipeline
400+
queries /
minute
10X
faster
queries
<500ms
Latency on
99% of
queries

©2022, Imply 31
Summary: Real-time use cases for Kafka-Flink-Druid
ALERTING MONITORING DASHBOARDS EXPLORATION DECISIONING
State or stateless
triggered actions
Continuous
tracking of KPIs
User-facing
operational visibility
Ad-hoc rapid
data exploration
On high throughput Kafka streams
API-triggered
automated workﬂows
if X, then Y

©2022, Imply 32
What makes Kafka-Flink-Druid so popular?
Stream-Native
All 3 are designed
natively for streaming
data, supporting
exactly-once semantics
and event-by-event
processing.
Massively Scalable Fault Tolerant
All 3 can handle massive
event throughput into
the millions of events
per second across
delivery, processing, and
analytics.
All 3 work in tandem
for mission-critical use
cases with guaranteed
consistency, data and
node replication, and
data durability.
Complementary
All 3 provide distinct
capabilities that
together serve the
full-breadth of
real-time application
use cases.

"When used in combination, Apache Flink & Apache Kafka can enable data reusability and avoid redundant
downstream processing. The delivery of Flink & Kafka as fully managed services delivers stream processing
without the complexities of infrastructure management, enabling teams to focus on building real-time streaming
applications & pipelines that differentiate the business."
Enterprise-grade security
Secure stream processing with built-in identity and access
management, RBAC, and audit logs
Stream governance
Enforce data policies and avoid metadata duplication
leveraging native integration with Stream Governance
Monitoring
Ensure the health and uptime of your Flink queries in the
Confluent UI or via 3rd party monitoring services
Connectors
Ensure the health and uptime of your Flink queries in the
Confluent UI or via 3rd party monitoring services
Monitoring Connectors
Enterprise-grade
Security
Stream
Governance
Confluent Cloud: Unified platform for Kafka and Flink seamlessly integrated

©2022, Imply 35
Imply Polaris
The Cloud Database Service
for Apache Druid
Most
Aﬀordable
Most
Secure
Best Time
to Value
And for OS Druid Users

A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid

Recommended

Recommended

More Related Content

Similar to A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid

Similar to A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid