Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022

Don’t Forget about your past—
optimizing Apache Druid
performance with batch and real-time
Current 2022
Neil Buesing, Kinetic Edge
@nbuesing nbuesing
https://www.kineticedge.io

Goal
• Sleep as well as my dog, Katniss.

Goals
1. Technology Overview of Apache Druid and Apache Kafka
2. How to run Apache Druid and Apache Kafka locally
3. Druid Ingestion in real-time and batch
4. Query the data using Druid SQL Console
5. Con
fi
gure Apache Druid Real-Time Ingestion to make it safe to
reload historical segments
6. Real-Time and Batch Ingestion: working together

Apache Druid
1. Apache Druid still uses term master (I want it to be renamed)
2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true
3. peons are processes, incubating e
ff
ort indexer uses threads instead
4. postgres or MySQL
Query
broker
router
Command
coordinator
overlord
Data
middlemanager
historical
Dependencies
metadata store
zookeeper
peon(s)
Storage
1
2
3
4

• File Format
• Segmentation
• Time
• Dimensions
• Metrics
__time dimensions metrics
Apache Druid

Apache Druid
• Time
• Segment Granularity
• Query Granularity
2021-12-07 T 22:00:00 Z
2021-12-07 T 22:15:00 Z
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:15:00
22:18:00
22:15:00
22:18:00
22:09:00
22:09:00
22:05:00
22:13:00

Apache Druid
• Why Query Granularity?
• Partially
Precomputed
Aggregates
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:18:49
22:18:00
22:18:34 Cloud 9 1234A
Store SKU
1
COUNT QTY
Cloud 9 1234A 1
4
3
Cloud 9 1234A 2 7
With real-time ingestion precomputed
aggregates are not absolute.
select sum(count), count(count) are not the same

Apache Druid
middlemanager
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
historical
middlemanager
task
task
task
task
broker
query
router
ui coordinator
overlord
metadata store
zookeeper
active
real-tim
e
segm
ents
only

Apache Druid
select
DATE_TRUNC('DAY',
_
_
time) "TIME",
storeId,
sku,
sum("count") "CNT",
sum(quantity) "QTY"
from skus
group by 1, 2, 3
order by 1 desc, 4 desc

Apache Druid - Aggregates
• Rollable
• Count
• Sum
• Min
• Max
• Unique Counts (Approximations) - super cool!
• First (String)
• Last (String)
• Non Rollable
• Mean
• First (Numeric)
• Last (Numeric)
String First & Last aggregate on rollup
store actual timestamp.

Apache Druid - Unique Counts
Apache Data Sketches : Theta : k=4
0.0 1.0
Star Trek : 0.590
Quantum Leap : 0.698
Fire
fl
y : 0.465
X-Files : 0.335
Mandalorian : 0.825
Battlestar Galactica : 0.323
4 * (1 / 0.465) = 8.6
k * (1 / theta)
Uniform Random Hash
Stranger Things : 0.238

My All-Time Favorite Druid Query
Apache Druid - Rollup Factor
select sum("count") "Logical Count",
count("count") "Physical Count",
sum("count")/(count("count")*1.0) "Rollup Factor"
from datasource

partitioner = murmur2_random
consumer B’

Apache Kafka
Kafka
Connect
(source)
Kafka
Connect
(sink)
Schema Registry
Producer
Application
Apache
Druid
(Consumer)
Kafka Streams
Application
ksqlDB

Apache
Druid Middle Manager
Apache Kafka & Druid
Apache Kafka
broker
broker
task-0
broker
a:0
a:1
a:2
a:0
a:0
a:1
a:1
a:2
a:2
druid superviser
task-1
task-2
assign()
metadata store

Druid Middle
Manager
Deep Storage
23:00:00Z
23:00:00Z
22:00:00Z
22:00:00Z
druid superviser
23:00:00Z
24:00:00Z
23:10
23:11
22:59
22:01
23:55
24:55
task-0
task-1
task-0
08:33
task-1
08:00:00Z

Druid Middle
Manager
druid
task-1
01:xx
01:00:00Z
02:xx
02:00:00Z
03:xx
03:00:00Z
04:xx
04:00:00Z
05:xx
05:00:00Z
06:00:00Z
06:xx
07:00:00Z
07:xx
08:00:00Z
08:xx
09:00:00Z
09:xx
10:00:00Z
10:xx
task

Druid Middle
Manager
druid superviser
task-1
01:00:00Z
02:00:00Z
03:00:00Z
04:00:00Z
05:00:00Z
06:00:00Z
07:00:00Z
08:00:00Z
09:00:00Z
10:00:00Z
task
a
v
o
i
d

• Fragmented Segments
• storage costs
• query performance
• compaction cost
• Open File Handles
• middle manager resources
01:00:00Z
02:00:00Z
03:00:00Z
04:00:00Z
05:00:00Z
06:00:00Z
07:00:00Z
08:00:00Z
09:00:00Z
10:00:00Z
task

Development
A Local Environment

Kafka Local
• https://github.com/kineticedge/dev-local
• kafka
• druid
• kafka-connect
• ksqlDB
• mongo
• grafana/prometheus dashboards
• mysql
• superset
• and more
CP Images (7.2.0+) support arm64/v8 images
druid need to build your own arm64/v8 images
docker inspect image:version --format “{{.Architecture}}"
Apple Silicon?

Kafka Local Demos
• https://github.com/kineticedge/dev-local-demos
• Uses dev-local Container Based Environment
• demos with up/setup/down scripts for easy execution
• druid-late
• key-mismatch
• rdbms-cdc-nosql
• mongo-cdc
• … and more to come …
Today's Demo

Kafka Local / DEMO
cd dev-local-demo/druid-late
.
README.md
up.sh
setup.sh
druid.sh
connect.sh
producer/run.sh
Apache Kafka
Apache Druid
Kafka Connect / S3 Sink
Minio
Java Producer - Fake Data

Kafka Local / DEMO
SELECT
(case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE",
count(*) "COUNT"
FROM sys.segments
GROUP BY 1

Apache Druid
Real-Time
and
Batch

Apache Druid - Real-Time
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
broker
query
real-time
batch
real-time (handed-o
ff
)

Apache Druid
• reject messages earlier than period before the task was
created
• lateMessageRejectionPeriod
• e.g. PT1H

Apache Druid - Real-Time & Batch
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
broker
lateMessageRejectionPeriod
PT1H
Append
or
Reload
historical
middlemanager
task
task
query
real-time
batch
real-time (handed-o
ff
)

Apache Druid - Batch Task
.
.
.
"pref
i
xes": [
"s3
:
/
/
sku/topics/skus/y=2022/m=09/"
],
.
.
.
"intervals": [
"2022-09-01T00
:
00
:
00/2022-10-01T00
:
00
:
00"
]
.
.
.

https://github.com/kineticedge/dev-local-demos
Demonstration
@nbuesing nbuesing

Questions
@nbuesing nbuesing
https://github.com/kineticedge
dev-local - container ecosystem
dev-local-demos - demonstrations
druid-m1 - build arm64/v8 image for your Apple Silicon
… & more …
https://www.kineticedge.io

Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022

Similar to Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022