Don’t Forget about your past—
optimizing Apache Druid
performance with batch and real-time
Current 2022
Neil Buesing, Kinetic Edge
@nbuesing nbuesing
https://www.kineticedge.io
Goal
• Sleep as well as my dog, Katniss.
Goals
1. Technology Overview of Apache Druid and Apache Kafka
2. How to run Apache Druid and Apache Kafka locally
3. Druid Ingestion in real-time and batch
4. Query the data using Druid SQL Console
5. Con
fi
gure Apache Druid Real-Time Ingestion to make it safe to
reload historical segments
6. Real-Time and Batch Ingestion: working together
Apache Druid
Overview
Apache Druid
1. Apache Druid still uses term master (I want it to be renamed)
2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true
3. peons are processes, incubating e
ff
ort indexer uses threads instead
4. postgres or MySQL
Query
broker
router
Command
coordinator
overlord
Data
middlemanager
historical
Dependencies
metadata store
zookeeper
peon(s)
Storage
1
2
3
4
• File Format
• Segmentation
• Time
• Dimensions
• Metrics
__time dimensions metrics
Apache Druid
Apache Druid
• Time
• Segment Granularity
• Query Granularity
__time dimensions metrics
__time dimensions metrics
2021-12-07 T 22:00:00 Z
2021-12-07 T 22:15:00 Z
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:15:00
22:18:00
22:15:00
22:18:00
22:09:00
22:09:00
22:05:00
22:13:00
Apache Druid
• Why Query Granularity?
• Partially
Precomputed
Aggregates
__time dimensions metrics
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:18:49
22:18:00
22:18:34 Cloud 9 1234A
Store SKU
1
COUNT QTY
Cloud 9 1234A 1
4
3
Cloud 9 1234A 2 7
With real-time ingestion precomputed
aggregates are not absolute.
select sum(count), count(count) are not the same
Apache Druid
middlemanager
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics
__time dimensions metrics
historical
historical
middlemanager
task
task
task
task
broker
query
router
ui coordinator
overlord
metadata store
zookeeper
active
real-tim
e
segm
ents
only
Apache Druid
select
DATE_TRUNC('DAY',
_
_
time) "TIME",
storeId,
sku,
sum("count") "CNT",
sum(quantity) "QTY"
from skus
group by 1, 2, 3
order by 1 desc, 4 desc
Apache Druid - Aggregates
• Rollable
• Count
• Sum
• Min
• Max
• Unique Counts (Approximations) - super cool!
• First (String)
• Last (String)
• Non Rollable
• Mean
• First (Numeric)
• Last (Numeric)
String First & Last aggregate on rollup
store actual timestamp.
Apache Druid - Unique Counts
Apache Data Sketches : Theta : k=4
0.0 1.0
Star Trek : 0.590
Quantum Leap : 0.698
Fire
fl
y : 0.465
X-Files : 0.335
Mandalorian : 0.825
Battlestar Galactica : 0.323
4 * (1 / 0.465) = 8.6
k * (1 / theta)
Uniform Random Hash
Stranger Things : 0.238
My All-Time Favorite Druid Query
Apache Druid - Rollup Factor
select sum("count") "Logical Count",
count("count") "Physical Count",
sum("count")/(count("count")*1.0) "Rollup Factor"
from datasource
Apache Kafka
Overview
Apache Kafka
Kafka Raft
partitioner = murmur2_random
consumer B’
Apache Kafka
Kafka
Connect
(source)
Kafka
Connect
(sink)
Schema Registry
Producer
Application
Apache
Druid
(Consumer)
Kafka Streams
Application
ksqlDB
Apache Druid & Kafka
Overview
Apache
Druid Middle Manager
Apache Kafka & Druid
Apache Kafka
broker
broker
task-0
broker
a:0
a:1
a:2
a:0
a:0
a:1
a:1
a:2
a:2
druid superviser
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
task-2
assign()
metadata store
Druid Middle
Manager
Deep Storage
__time dimensions metrics
23:00:00Z
__time dimensions metrics
23:00:00Z
__time dimensions metrics
22:00:00Z
__time dimensions metrics
22:00:00Z
Apache Kafka & Druid
druid superviser
__time dimensions metrics
23:00:00Z
__time dimensions metrics
24:00:00Z
23:10
23:11
22:59
22:01
23:55
24:55
task-0
task-1
task-0
08:33
task-1
__time dimensions metrics
08:00:00Z
Druid Middle
Manager
Apache Kafka & Druid
druid
task-1
01:xx
__time dimensions metrics
01:00:00Z
02:xx
__time dimensions metrics
02:00:00Z
03:xx
__time dimensions metrics
03:00:00Z
04:xx
__time dimensions metrics
04:00:00Z
05:xx
__time dimensions metrics
05:00:00Z
__time dimensions metrics
06:00:00Z
06:xx
__time dimensions metrics
07:00:00Z
07:xx
__time dimensions metrics
08:00:00Z
08:xx
__time dimensions metrics
09:00:00Z
09:xx
__time dimensions metrics
10:00:00Z
10:xx
task
Druid Middle
Manager
Apache Kafka & Druid
druid superviser
task-1
__time dimensions metrics
01:00:00Z
__time dimensions metrics
02:00:00Z
__time dimensions metrics
03:00:00Z
__time dimensions metrics
04:00:00Z
__time dimensions metrics
05:00:00Z
__time dimensions metrics
06:00:00Z
__time dimensions metrics
07:00:00Z
__time dimensions metrics
08:00:00Z
__time dimensions metrics
09:00:00Z
__time dimensions metrics
10:00:00Z
task
a
v
o
i
d
• Fragmented Segments
• storage costs
• query performance
• compaction cost
• Open File Handles
• middle manager resources
Apache Kafka & Druid
__time dimensions metrics
01:00:00Z
__time dimensions metrics
02:00:00Z
__time dimensions metrics
03:00:00Z
__time dimensions metrics
04:00:00Z
__time dimensions metrics
05:00:00Z
__time dimensions metrics
06:00:00Z
__time dimensions metrics
07:00:00Z
__time dimensions metrics
08:00:00Z
__time dimensions metrics
09:00:00Z
__time dimensions metrics
10:00:00Z
task
Apache Superset
Overview
Apache Superset
Apache Superset
Development
A Local Environment
Kafka Local
• https://github.com/kineticedge/dev-local
• kafka
• druid
• kafka-connect
• ksqlDB
• mongo
• grafana/prometheus dashboards
• mysql
• superset
• and more
CP Images (7.2.0+) support arm64/v8 images
druid need to build your own arm64/v8 images
docker inspect image:version --format “{{.Architecture}}"
Apple Silicon?
Kafka Local Demos
• https://github.com/kineticedge/dev-local-demos
• Uses dev-local Container Based Environment
• demos with up/setup/down scripts for easy execution
• druid-late
• key-mismatch
• rdbms-cdc-nosql
• mongo-cdc
• … and more to come …
Today's Demo
Kafka Local / DEMO
cd dev-local-demo/druid-late
.
README.md
up.sh
setup.sh
druid.sh
connect.sh
producer/run.sh
Apache Kafka
Apache Druid
Kafka Connect / S3 Sink
Minio
Java Producer - Fake Data
Kafka Local / DEMO
SELECT
(case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE",
count(*) "COUNT"
FROM sys.segments
GROUP BY 1
Apache Druid
Real-Time
and
Batch
Apache Druid - Real-Time
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
broker
query
real-time
batch
real-time (handed-o
ff
)
Apache Druid
• reject messages earlier than period before the task was
created
• lateMessageRejectionPeriod
• e.g. PT1H
Apache Druid - Real-Time & Batch
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
broker
lateMessageRejectionPeriod
PT1H
Append
or
Reload
historical
middlemanager
task
task
query
real-time
batch
real-time (handed-o
ff
)
Apache Druid - Batch Task
.
.
.
"pref
i
xes": [
"s3
:
/
/
sku/topics/skus/y=2022/m=09/"
],
.
.
.
"intervals": [
"2022-09-01T00
:
00
:
00/2022-10-01T00
:
00
:
00"
]
.
.
.
Apache Druid
Demonstration
Apache Druid - Real-Time & Batch
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
broker
lateMessageRejectionPeriod
PT1H
Append
or
Reload
historical
middlemanager
task
task
query
real-time
batch
real-time (handed-o
ff
)
https://github.com/kineticedge/dev-local-demos
Demonstration
@nbuesing nbuesing
Questions
@nbuesing nbuesing
https://github.com/kineticedge
dev-local - container ecosystem
dev-local-demos - demonstrations
druid-m1 - build arm64/v8 image for your Apple Silicon
… & more …
https://www.kineticedge.io

Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022

  • 1.
    Don’t Forget aboutyour past— optimizing Apache Druid performance with batch and real-time Current 2022 Neil Buesing, Kinetic Edge @nbuesing nbuesing https://www.kineticedge.io
  • 2.
    Goal • Sleep aswell as my dog, Katniss.
  • 3.
    Goals 1. Technology Overviewof Apache Druid and Apache Kafka 2. How to run Apache Druid and Apache Kafka locally 3. Druid Ingestion in real-time and batch 4. Query the data using Druid SQL Console 5. Con fi gure Apache Druid Real-Time Ingestion to make it safe to reload historical segments 6. Real-Time and Batch Ingestion: working together
  • 4.
  • 5.
    Apache Druid 1. ApacheDruid still uses term master (I want it to be renamed) 2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true 3. peons are processes, incubating e ff ort indexer uses threads instead 4. postgres or MySQL Query broker router Command coordinator overlord Data middlemanager historical Dependencies metadata store zookeeper peon(s) Storage 1 2 3 4
  • 6.
    • File Format •Segmentation • Time • Dimensions • Metrics __time dimensions metrics Apache Druid
  • 7.
    Apache Druid • Time •Segment Granularity • Query Granularity __time dimensions metrics __time dimensions metrics 2021-12-07 T 22:00:00 Z 2021-12-07 T 22:15:00 Z 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:15:00 22:18:00 22:15:00 22:18:00 22:09:00 22:09:00 22:05:00 22:13:00
  • 8.
    Apache Druid • WhyQuery Granularity? • Partially Precomputed Aggregates __time dimensions metrics 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:18:49 22:18:00 22:18:34 Cloud 9 1234A Store SKU 1 COUNT QTY Cloud 9 1234A 1 4 3 Cloud 9 1234A 2 7 With real-time ingestion precomputed aggregates are not absolute. select sum(count), count(count) are not the same
  • 9.
    Apache Druid middlemanager Deep Storage __timedimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics historical historical middlemanager task task task task broker query router ui coordinator overlord metadata store zookeeper active real-tim e segm ents only
  • 10.
    Apache Druid select DATE_TRUNC('DAY', _ _ time) "TIME", storeId, sku, sum("count")"CNT", sum(quantity) "QTY" from skus group by 1, 2, 3 order by 1 desc, 4 desc
  • 11.
    Apache Druid -Aggregates • Rollable • Count • Sum • Min • Max • Unique Counts (Approximations) - super cool! • First (String) • Last (String) • Non Rollable • Mean • First (Numeric) • Last (Numeric) String First & Last aggregate on rollup store actual timestamp.
  • 12.
    Apache Druid -Unique Counts Apache Data Sketches : Theta : k=4 0.0 1.0 Star Trek : 0.590 Quantum Leap : 0.698 Fire fl y : 0.465 X-Files : 0.335 Mandalorian : 0.825 Battlestar Galactica : 0.323 4 * (1 / 0.465) = 8.6 k * (1 / theta) Uniform Random Hash Stranger Things : 0.238
  • 13.
    My All-Time FavoriteDruid Query Apache Druid - Rollup Factor select sum("count") "Logical Count", count("count") "Physical Count", sum("count")/(count("count")*1.0) "Rollup Factor" from datasource
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    Apache Druid &Kafka Overview
  • 19.
    Apache Druid Middle Manager ApacheKafka & Druid Apache Kafka broker broker task-0 broker a:0 a:1 a:2 a:0 a:0 a:1 a:1 a:2 a:2 druid superviser __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 task-2 assign() metadata store
  • 20.
    Druid Middle Manager Deep Storage __timedimensions metrics 23:00:00Z __time dimensions metrics 23:00:00Z __time dimensions metrics 22:00:00Z __time dimensions metrics 22:00:00Z Apache Kafka & Druid druid superviser __time dimensions metrics 23:00:00Z __time dimensions metrics 24:00:00Z 23:10 23:11 22:59 22:01 23:55 24:55 task-0 task-1 task-0 08:33 task-1 __time dimensions metrics 08:00:00Z
  • 21.
    Druid Middle Manager Apache Kafka& Druid druid task-1 01:xx __time dimensions metrics 01:00:00Z 02:xx __time dimensions metrics 02:00:00Z 03:xx __time dimensions metrics 03:00:00Z 04:xx __time dimensions metrics 04:00:00Z 05:xx __time dimensions metrics 05:00:00Z __time dimensions metrics 06:00:00Z 06:xx __time dimensions metrics 07:00:00Z 07:xx __time dimensions metrics 08:00:00Z 08:xx __time dimensions metrics 09:00:00Z 09:xx __time dimensions metrics 10:00:00Z 10:xx task
  • 22.
    Druid Middle Manager Apache Kafka& Druid druid superviser task-1 __time dimensions metrics 01:00:00Z __time dimensions metrics 02:00:00Z __time dimensions metrics 03:00:00Z __time dimensions metrics 04:00:00Z __time dimensions metrics 05:00:00Z __time dimensions metrics 06:00:00Z __time dimensions metrics 07:00:00Z __time dimensions metrics 08:00:00Z __time dimensions metrics 09:00:00Z __time dimensions metrics 10:00:00Z task a v o i d
  • 23.
    • Fragmented Segments •storage costs • query performance • compaction cost • Open File Handles • middle manager resources Apache Kafka & Druid __time dimensions metrics 01:00:00Z __time dimensions metrics 02:00:00Z __time dimensions metrics 03:00:00Z __time dimensions metrics 04:00:00Z __time dimensions metrics 05:00:00Z __time dimensions metrics 06:00:00Z __time dimensions metrics 07:00:00Z __time dimensions metrics 08:00:00Z __time dimensions metrics 09:00:00Z __time dimensions metrics 10:00:00Z task
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Kafka Local • https://github.com/kineticedge/dev-local •kafka • druid • kafka-connect • ksqlDB • mongo • grafana/prometheus dashboards • mysql • superset • and more CP Images (7.2.0+) support arm64/v8 images druid need to build your own arm64/v8 images docker inspect image:version --format “{{.Architecture}}" Apple Silicon?
  • 29.
    Kafka Local Demos •https://github.com/kineticedge/dev-local-demos • Uses dev-local Container Based Environment • demos with up/setup/down scripts for easy execution • druid-late • key-mismatch • rdbms-cdc-nosql • mongo-cdc • … and more to come … Today's Demo
  • 30.
    Kafka Local /DEMO cd dev-local-demo/druid-late . README.md up.sh setup.sh druid.sh connect.sh producer/run.sh Apache Kafka Apache Druid Kafka Connect / S3 Sink Minio Java Producer - Fake Data
  • 31.
    Kafka Local /DEMO SELECT (case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE", count(*) "COUNT" FROM sys.segments GROUP BY 1
  • 32.
  • 33.
    Apache Druid -Real-Time Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics historical middlemanager task broker query real-time batch real-time (handed-o ff )
  • 34.
    Apache Druid • rejectmessages earlier than period before the task was created • lateMessageRejectionPeriod • e.g. PT1H
  • 35.
    Apache Druid -Real-Time & Batch Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics broker lateMessageRejectionPeriod PT1H Append or Reload historical middlemanager task task query real-time batch real-time (handed-o ff )
  • 36.
    Apache Druid -Batch Task . . . "pref i xes": [ "s3 : / / sku/topics/skus/y=2022/m=09/" ], . . . "intervals": [ "2022-09-01T00 : 00 : 00/2022-10-01T00 : 00 : 00" ] . . .
  • 37.
  • 38.
    Apache Druid -Real-Time & Batch Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics broker lateMessageRejectionPeriod PT1H Append or Reload historical middlemanager task task query real-time batch real-time (handed-o ff )
  • 39.
  • 40.
    Questions @nbuesing nbuesing https://github.com/kineticedge dev-local -container ecosystem dev-local-demos - demonstrations druid-m1 - build arm64/v8 image for your Apple Silicon … & more … https://www.kineticedge.io