Don't Forget Your Past, From Batch to
Real-Time and Back Again with
Apache Druid
Open Source North 2022


Neil Buesing, Rill Data
@nbuesing nbuesing
Goals
1. Technology Overview of Apache Druid and Apache Ka
fk
a


2. How to run Apache Druid and Apache Ka
fk
a locally


3. Druid Ingestion in batch and real-time


4. Query the data using Druid SQL Console and Apache Superset


5. Con
fi
gure Apache Druid Real-Time Ingestion to make it safe to reload
historical segments


6. Real-Time and Batch Ingestion: working together.
Apache Druid
Overview
Apache Druid
1. Apache Druid still uses term master (hoping for a rename)


2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true


3. indexer - incubating replacement for middlemanager, peons are threads, not processes


4. postgres or MySQL
Query
broker
router
Command(1)
coordinator
overlord(2)
Data
middlemanager(3)
historical
Dependencies
metadata store(4)
zookeeper
peon(s)
Storage
• File Format


• Segmentation


• Time


• Dimensions


• Metrics
__time dimensions metrics
Apache Druid
Apache Druid
• Time


• Segment Granularity


• Query Granularity
__time dimensions metrics
__time dimensions metrics
2021-12-07 T 22:00:00 Z
2021-12-07 T 22:15:00 Z
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z 22:15:00
22:18:00
22:15:00
22:18:00
22:09:00
22:09:00
22:05:00
22:13:00
Apache Druid
• Why Query Granularity?


• Partially


Precomputed


Aggregates
__time dimensions metrics
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:18:49
22:18:00
22:18:34 Cloud 9 1234A
Store SKU
1
COUNT QTY
Cloud 9 1234A 1
4
3
Cloud 9 1234A 2 7
Apache Druid
middlemanager
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics
__time dimensions metrics
historical
historical
middlemanager
task
task
task
task
broker
query
router
ui coordinator
overlord
metadata store
zookeeper
Apache Druid
select


DATE_TRUNC('DAY',
_
_
time) "TIME",


storeId,


sku,


sum("count") "CNT",


sum(quantity) "QTY"


from skus


group by 1, 2, 3


order by 1 desc, 4 desc
Apache Kafka
Overview
Apache Kafka
Ka
fk
a
Connect


(source)
Ka
fk
a
Connect
(sink)
Schema Registry
Ka
fk
a Streams


Application
ksqlDB
Producer
Application
Consumer


Application
Apache


Druid
Apache Druid & Kafka
Overview
Apache Druid
Middle Manager
Apache Kafka & Druid
Apache Ka
fk
a
broker
broker
task-0
broker
a:0
a:1
a:2
a:0
a:0
a:1
a:1
a:2
a:2
druid superviser
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
task-2
assign()
Druid Middle
Manager
Deep Storage
__time dimensions metrics
23:00:00Z
__time dimensions metrics
23:00:00Z
__time dimensions metrics
22:00:00Z
__time dimensions metrics
22:00:00Z
Apache Kafka & Druid
druid superviser
__time dimensions metrics
23:00:00Z
__time dimensions metrics
23:00:00Z
23:10
23:11
22:59
22:01
23:55
23:59
task-0
task-1
task-0
task-1
Druid Middle
Manager
Apache Kafka & Druid
druid superviser
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task
a
v
o
i
d
• Fragmented Segments


• storage costs


• query performance


• compaction cost


• Open File Handles


• middle manager resources
Apache Kafka & Druid
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task
Apache Superset
Overview
Apache Superset
Apache Superset
Apache Superset
Development
A Local Environment
Kafka Local
• https://github.com/nbuesing/ka
fk
a-local


• ka
fk
a


• druid


• superset


• ka
fk
a-connect


• ksqlDB


• mongo


• grafana/prometheus dashboards


• mysql


• oracle*


• and more
* Must Build container yourself (instructions provided) - not M1 compatible
Kafka Local
• Container Based Environment


• Each component its own docker-compose "mix-n-match"


• shared network


• demos with up/setup/down scripts for easy up and run


• druid_late


• druid_rollup


• opensky
Today's Demo
Kafka Local / DEMO
cd kafka-local/demo/druid_late


.


up.sh


setup.sh


druid.sh


connect.sh


producer/run.sh


superset.sh


Apache Kafka


Apache Druid


Apache Superset


Kafka Connect / S3 Sink


Minio


Java Producer - Fake Data
Kafka Local / DEMO
SELECT


(case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE",


count(*) "COUNT"


FROM sys.segments


GROUP BY 1
Apache Druid
Real-Time


and


Batch
Apache Druid
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
task
broker
query
real-time batch
real-time (handed-o
ff
)
Apache Druid
• reject messages earlier than period before the task was created


• lateMessageRejectionPeriod


• e.g. PT1H
Apache Druid
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
task
broker
query
real-time batch
real-time (handed-o
ff
)
lateMessageRejectionPeriod


PT1H
Append
or
Reload
Kafka Local / DEMO
.
.
.


"prefixes": [


"s3:
/
/
sku/topics/skus/y=2022/m=01/",


"s3:
/
/
sku/topics/skus/y=2022/m=02/",


"s3:
/
/
sku/topics/skus/y=2022/m=03/",


"s3:
/
/
sku/topics/skus/y=2022/m=04/",


"s3:
/
/
sku/topics/skus/y=2022/m=05/"


],


.
.
.


"intervals": [


"2022-01-01T00:00:00/2022-05-01T00:00:00"


]


.
.
.
https://github.com/nbuesing/ka
fk
a-local
Questions
@nbuesing nbuesing

OSN_2022.pdf

  • 1.
    Don't Forget YourPast, From Batch to Real-Time and Back Again with Apache Druid Open Source North 2022 Neil Buesing, Rill Data @nbuesing nbuesing
  • 2.
    Goals 1. Technology Overviewof Apache Druid and Apache Ka fk a 2. How to run Apache Druid and Apache Ka fk a locally 3. Druid Ingestion in batch and real-time 4. Query the data using Druid SQL Console and Apache Superset 5. Con fi gure Apache Druid Real-Time Ingestion to make it safe to reload historical segments 6. Real-Time and Batch Ingestion: working together.
  • 3.
  • 4.
    Apache Druid 1. ApacheDruid still uses term master (hoping for a rename) 2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true 3. indexer - incubating replacement for middlemanager, peons are threads, not processes 4. postgres or MySQL Query broker router Command(1) coordinator overlord(2) Data middlemanager(3) historical Dependencies metadata store(4) zookeeper peon(s) Storage
  • 5.
    • File Format •Segmentation • Time • Dimensions • Metrics __time dimensions metrics Apache Druid
  • 6.
    Apache Druid • Time •Segment Granularity • Query Granularity __time dimensions metrics __time dimensions metrics 2021-12-07 T 22:00:00 Z 2021-12-07 T 22:15:00 Z 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:15:00 22:18:00 22:15:00 22:18:00 22:09:00 22:09:00 22:05:00 22:13:00
  • 7.
    Apache Druid • WhyQuery Granularity? • Partially 
 Precomputed 
 Aggregates __time dimensions metrics 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:18:49 22:18:00 22:18:34 Cloud 9 1234A Store SKU 1 COUNT QTY Cloud 9 1234A 1 4 3 Cloud 9 1234A 2 7
  • 8.
    Apache Druid middlemanager Deep Storage __timedimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics historical historical middlemanager task task task task broker query router ui coordinator overlord metadata store zookeeper
  • 9.
    Apache Druid select DATE_TRUNC('DAY', _ _ time) "TIME", storeId, sku, sum("count")"CNT", sum(quantity) "QTY" from skus group by 1, 2, 3 order by 1 desc, 4 desc
  • 10.
  • 11.
  • 13.
  • 14.
    Apache Druid &Kafka Overview
  • 15.
    Apache Druid Middle Manager ApacheKafka & Druid Apache Ka fk a broker broker task-0 broker a:0 a:1 a:2 a:0 a:0 a:1 a:1 a:2 a:2 druid superviser __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 task-2 assign()
  • 16.
    Druid Middle Manager Deep Storage __timedimensions metrics 23:00:00Z __time dimensions metrics 23:00:00Z __time dimensions metrics 22:00:00Z __time dimensions metrics 22:00:00Z Apache Kafka & Druid druid superviser __time dimensions metrics 23:00:00Z __time dimensions metrics 23:00:00Z 23:10 23:11 22:59 22:01 23:55 23:59 task-0 task-1 task-0 task-1
  • 17.
    Druid Middle Manager Apache Kafka& Druid druid superviser __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics task a v o i d
  • 18.
    • Fragmented Segments •storage costs • query performance • compaction cost • Open File Handles • middle manager resources Apache Kafka & Druid __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics task
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    Kafka Local • https://github.com/nbuesing/ka fk a-local •ka fk a • druid • superset • ka fk a-connect • ksqlDB • mongo • grafana/prometheus dashboards • mysql • oracle* • and more * Must Build container yourself (instructions provided) - not M1 compatible
  • 25.
    Kafka Local • ContainerBased Environment • Each component its own docker-compose "mix-n-match" • shared network • demos with up/setup/down scripts for easy up and run • druid_late • druid_rollup • opensky Today's Demo
  • 26.
    Kafka Local /DEMO cd kafka-local/demo/druid_late . up.sh setup.sh druid.sh connect.sh producer/run.sh superset.sh Apache Kafka Apache Druid Apache Superset Kafka Connect / S3 Sink Minio Java Producer - Fake Data
  • 27.
    Kafka Local /DEMO SELECT (case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE", count(*) "COUNT" FROM sys.segments GROUP BY 1
  • 28.
  • 29.
    Apache Druid Deep Storage __timedimensions metrics __time dimensions metrics __time dimensions metrics historical middlemanager task task broker query real-time batch real-time (handed-o ff )
  • 30.
    Apache Druid • rejectmessages earlier than period before the task was created • lateMessageRejectionPeriod • e.g. PT1H
  • 31.
    Apache Druid Deep Storage __timedimensions metrics __time dimensions metrics __time dimensions metrics historical middlemanager task task broker query real-time batch real-time (handed-o ff ) lateMessageRejectionPeriod PT1H Append or Reload
  • 32.
    Kafka Local /DEMO . . . "prefixes": [ "s3: / / sku/topics/skus/y=2022/m=01/", "s3: / / sku/topics/skus/y=2022/m=02/", "s3: / / sku/topics/skus/y=2022/m=03/", "s3: / / sku/topics/skus/y=2022/m=04/", "s3: / / sku/topics/skus/y=2022/m=05/" ], . . . "intervals": [ "2022-01-01T00:00:00/2022-05-01T00:00:00" ] . . .
  • 33.