SlideShare a Scribd company logo
1 of 33
Download to read offline
Don't Forget Your Past, From Batch to
Real-Time and Back Again with
Apache Druid
Open Source North 2022


Neil Buesing, Rill Data
@nbuesing nbuesing
Goals
1. Technology Overview of Apache Druid and Apache Ka
fk
a


2. How to run Apache Druid and Apache Ka
fk
a locally


3. Druid Ingestion in batch and real-time


4. Query the data using Druid SQL Console and Apache Superset


5. Con
fi
gure Apache Druid Real-Time Ingestion to make it safe to reload
historical segments


6. Real-Time and Batch Ingestion: working together.
Apache Druid
Overview
Apache Druid
1. Apache Druid still uses term master (hoping for a rename)


2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true


3. indexer - incubating replacement for middlemanager, peons are threads, not processes


4. postgres or MySQL
Query
broker
router
Command(1)
coordinator
overlord(2)
Data
middlemanager(3)
historical
Dependencies
metadata store(4)
zookeeper
peon(s)
Storage
• File Format


• Segmentation


• Time


• Dimensions


• Metrics
__time dimensions metrics
Apache Druid
Apache Druid
• Time


• Segment Granularity


• Query Granularity
__time dimensions metrics
__time dimensions metrics
2021-12-07 T 22:00:00 Z
2021-12-07 T 22:15:00 Z
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z 22:15:00
22:18:00
22:15:00
22:18:00
22:09:00
22:09:00
22:05:00
22:13:00
Apache Druid
• Why Query Granularity?


• Partially


Precomputed


Aggregates
__time dimensions metrics
2021-12-07 T 22:18:34.123 Z
2021-12-07 T 22:18:00.000 Z
22:18:49
22:18:00
22:18:34 Cloud 9 1234A
Store SKU
1
COUNT QTY
Cloud 9 1234A 1
4
3
Cloud 9 1234A 2 7
Apache Druid
middlemanager
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics
__time dimensions metrics
historical
historical
middlemanager
task
task
task
task
broker
query
router
ui coordinator
overlord
metadata store
zookeeper
Apache Druid
select


DATE_TRUNC('DAY',
_
_
time) "TIME",


storeId,


sku,


sum("count") "CNT",


sum(quantity) "QTY"


from skus


group by 1, 2, 3


order by 1 desc, 4 desc
Apache Kafka
Overview
Apache Kafka
Ka
fk
a
Connect


(source)
Ka
fk
a
Connect
(sink)
Schema Registry
Ka
fk
a Streams


Application
ksqlDB
Producer
Application
Consumer


Application
Apache


Druid
Apache Druid & Kafka
Overview
Apache Druid
Middle Manager
Apache Kafka & Druid
Apache Ka
fk
a
broker
broker
task-0
broker
a:0
a:1
a:2
a:0
a:0
a:1
a:1
a:2
a:2
druid superviser
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
task-2
assign()
Druid Middle
Manager
Deep Storage
__time dimensions metrics
23:00:00Z
__time dimensions metrics
23:00:00Z
__time dimensions metrics
22:00:00Z
__time dimensions metrics
22:00:00Z
Apache Kafka & Druid
druid superviser
__time dimensions metrics
23:00:00Z
__time dimensions metrics
23:00:00Z
23:10
23:11
22:59
22:01
23:55
23:59
task-0
task-1
task-0
task-1
Druid Middle
Manager
Apache Kafka & Druid
druid superviser
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task
a
v
o
i
d
• Fragmented Segments


• storage costs


• query performance


• compaction cost


• Open File Handles


• middle manager resources
Apache Kafka & Druid
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task-1
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
__time dimensions metrics
task
Apache Superset
Overview
Apache Superset
Apache Superset
Apache Superset
Development
A Local Environment
Kafka Local
• https://github.com/nbuesing/ka
fk
a-local


• ka
fk
a


• druid


• superset


• ka
fk
a-connect


• ksqlDB


• mongo


• grafana/prometheus dashboards


• mysql


• oracle*


• and more
* Must Build container yourself (instructions provided) - not M1 compatible
Kafka Local
• Container Based Environment


• Each component its own docker-compose "mix-n-match"


• shared network


• demos with up/setup/down scripts for easy up and run


• druid_late


• druid_rollup


• opensky
Today's Demo
Kafka Local / DEMO
cd kafka-local/demo/druid_late


.


up.sh


setup.sh


druid.sh


connect.sh


producer/run.sh


superset.sh


Apache Kafka


Apache Druid


Apache Superset


Kafka Connect / S3 Sink


Minio


Java Producer - Fake Data
Kafka Local / DEMO
SELECT


(case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE",


count(*) "COUNT"


FROM sys.segments


GROUP BY 1
Apache Druid
Real-Time


and


Batch
Apache Druid
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
task
broker
query
real-time batch
real-time (handed-o
ff
)
Apache Druid
• reject messages earlier than period before the task was created


• lateMessageRejectionPeriod


• e.g. PT1H
Apache Druid
Deep Storage
__time dimensions metrics __time dimensions metrics __time dimensions metrics
historical
middlemanager
task
task
broker
query
real-time batch
real-time (handed-o
ff
)
lateMessageRejectionPeriod


PT1H
Append
or
Reload
Kafka Local / DEMO
.
.
.


"prefixes": [


"s3:
/
/
sku/topics/skus/y=2022/m=01/",


"s3:
/
/
sku/topics/skus/y=2022/m=02/",


"s3:
/
/
sku/topics/skus/y=2022/m=03/",


"s3:
/
/
sku/topics/skus/y=2022/m=04/",


"s3:
/
/
sku/topics/skus/y=2022/m=05/"


],


.
.
.


"intervals": [


"2022-01-01T00:00:00/2022-05-01T00:00:00"


]


.
.
.
https://github.com/nbuesing/ka
fk
a-local
Questions
@nbuesing nbuesing

More Related Content

Similar to OSN_2022.pdf

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
DataWorks Summit
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 

Similar to OSN_2022.pdf (20)

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
QMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e clouderaQMeeting 2018 - Como integrar qlik e cloudera
QMeeting 2018 - Como integrar qlik e cloudera
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Azure and Deep Learning
Azure and Deep LearningAzure and Deep Learning
Azure and Deep Learning
 
RMOUG 18 - Winning Performance Challenges in Oracle Multitenant
RMOUG 18 - Winning Performance Challenges in Oracle MultitenantRMOUG 18 - Winning Performance Challenges in Oracle Multitenant
RMOUG 18 - Winning Performance Challenges in Oracle Multitenant
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Winning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenantWinning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenant
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
OUGN winning performnace challenges in oracle Multitenant
OUGN   winning performnace challenges in oracle MultitenantOUGN   winning performnace challenges in oracle Multitenant
OUGN winning performnace challenges in oracle Multitenant
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

OSN_2022.pdf

  • 1. Don't Forget Your Past, From Batch to Real-Time and Back Again with Apache Druid Open Source North 2022 Neil Buesing, Rill Data @nbuesing nbuesing
  • 2. Goals 1. Technology Overview of Apache Druid and Apache Ka fk a 2. How to run Apache Druid and Apache Ka fk a locally 3. Druid Ingestion in batch and real-time 4. Query the data using Druid SQL Console and Apache Superset 5. Con fi gure Apache Druid Real-Time Ingestion to make it safe to reload historical segments 6. Real-Time and Batch Ingestion: working together.
  • 4. Apache Druid 1. Apache Druid still uses term master (hoping for a rename) 2. runs with coordinator, if druid.coordinator.asOverlord.enabled=true 3. indexer - incubating replacement for middlemanager, peons are threads, not processes 4. postgres or MySQL Query broker router Command(1) coordinator overlord(2) Data middlemanager(3) historical Dependencies metadata store(4) zookeeper peon(s) Storage
  • 5. • File Format • Segmentation • Time • Dimensions • Metrics __time dimensions metrics Apache Druid
  • 6. Apache Druid • Time • Segment Granularity • Query Granularity __time dimensions metrics __time dimensions metrics 2021-12-07 T 22:00:00 Z 2021-12-07 T 22:15:00 Z 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:15:00 22:18:00 22:15:00 22:18:00 22:09:00 22:09:00 22:05:00 22:13:00
  • 7. Apache Druid • Why Query Granularity? • Partially 
 Precomputed 
 Aggregates __time dimensions metrics 2021-12-07 T 22:18:34.123 Z 2021-12-07 T 22:18:00.000 Z 22:18:49 22:18:00 22:18:34 Cloud 9 1234A Store SKU 1 COUNT QTY Cloud 9 1234A 1 4 3 Cloud 9 1234A 2 7
  • 8. Apache Druid middlemanager Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics historical historical middlemanager task task task task broker query router ui coordinator overlord metadata store zookeeper
  • 9. Apache Druid select DATE_TRUNC('DAY', _ _ time) "TIME", storeId, sku, sum("count") "CNT", sum(quantity) "QTY" from skus group by 1, 2, 3 order by 1 desc, 4 desc
  • 12.
  • 14. Apache Druid & Kafka Overview
  • 15. Apache Druid Middle Manager Apache Kafka & Druid Apache Ka fk a broker broker task-0 broker a:0 a:1 a:2 a:0 a:0 a:1 a:1 a:2 a:2 druid superviser __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 task-2 assign()
  • 16. Druid Middle Manager Deep Storage __time dimensions metrics 23:00:00Z __time dimensions metrics 23:00:00Z __time dimensions metrics 22:00:00Z __time dimensions metrics 22:00:00Z Apache Kafka & Druid druid superviser __time dimensions metrics 23:00:00Z __time dimensions metrics 23:00:00Z 23:10 23:11 22:59 22:01 23:55 23:59 task-0 task-1 task-0 task-1
  • 17. Druid Middle Manager Apache Kafka & Druid druid superviser __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics task a v o i d
  • 18. • Fragmented Segments • storage costs • query performance • compaction cost • Open File Handles • middle manager resources Apache Kafka & Druid __time dimensions metrics __time dimensions metrics __time dimensions metrics task-1 __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics __time dimensions metrics task
  • 24. Kafka Local • https://github.com/nbuesing/ka fk a-local • ka fk a • druid • superset • ka fk a-connect • ksqlDB • mongo • grafana/prometheus dashboards • mysql • oracle* • and more * Must Build container yourself (instructions provided) - not M1 compatible
  • 25. Kafka Local • Container Based Environment • Each component its own docker-compose "mix-n-match" • shared network • demos with up/setup/down scripts for easy up and run • druid_late • druid_rollup • opensky Today's Demo
  • 26. Kafka Local / DEMO cd kafka-local/demo/druid_late . up.sh setup.sh druid.sh connect.sh producer/run.sh superset.sh Apache Kafka Apache Druid Apache Superset Kafka Connect / S3 Sink Minio Java Producer - Fake Data
  • 27. Kafka Local / DEMO SELECT (case is_realtime when 1 then 'REALTIME' else 'HISTORICAL' end) "TYPE", count(*) "COUNT" FROM sys.segments GROUP BY 1
  • 29. Apache Druid Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics historical middlemanager task task broker query real-time batch real-time (handed-o ff )
  • 30. Apache Druid • reject messages earlier than period before the task was created • lateMessageRejectionPeriod • e.g. PT1H
  • 31. Apache Druid Deep Storage __time dimensions metrics __time dimensions metrics __time dimensions metrics historical middlemanager task task broker query real-time batch real-time (handed-o ff ) lateMessageRejectionPeriod PT1H Append or Reload
  • 32. Kafka Local / DEMO . . . "prefixes": [ "s3: / / sku/topics/skus/y=2022/m=01/", "s3: / / sku/topics/skus/y=2022/m=02/", "s3: / / sku/topics/skus/y=2022/m=03/", "s3: / / sku/topics/skus/y=2022/m=04/", "s3: / / sku/topics/skus/y=2022/m=05/" ], . . . "intervals": [ "2022-01-01T00:00:00/2022-05-01T00:00:00" ] . . .