Druid Time-Series Data Store Explained

•

2 likes•91 views

Druid is an open-source data store designed for OLAP queries on event or time-series data. It ingests streaming or batch time-series data, aggregates the data during ingestion, and allows querying of aggregated metrics over time intervals via an HTTP API. It consists of several node types that work together, including brokers for querying, historical nodes for data storage and serving, real-time nodes for ingesting streaming data, and coordinators for managing data distribution. Druid uses other technologies like ZooKeeper, MySQL/Postgres, and deep storage systems to manage configurations, metadata, and actual data storage.

Software

Is it about woods creature?
Nope. It’s about…
a data store designed for OLAP queries on event
(time-series) data.
2

Boring facts
• Open-source, community-driven project
• ~3400 stars, ~7300 commits on Github ATM
• Written in Java
• Very modular / extendable thanks to Guice
3

What it does?
• Ingests stream / batch time-series data and
splits it into segments representing conﬁgured
time interval
• During ingestion, it performs aggregation using
provided algorithms to create metrics
• Allows to perform several types of queries over
served segments via nice HTTP API (here you
can do simple aggregations over metrics)
4

How it works?
Druid consists of several types of services (hereinafter,
nodes) which together make it complete working system:
• Broker
• Historical
• Real-time
• Coordinator
• Overlord, Middle Manager, Peons, etc. (aka boring
nodes)
6

How it works?
It also has three external dependencies:
• Apache ZooKeeper for conﬁguration management,
leader election and data ﬂow organisation
(coordinator - historical communication)
• Metadata Storage such as MySQL, PostgreSQL used
to store (guess what?) various metadata about the
system
• Deep Storage where all the compressed data is stored
(S3, HDFS, FS)
7

Nodes: Real-time
It is responsible for ingestion of streaming data.
Also it exposes HTTP API for querying, so data is
available for querying right after processing /
aggregation.
That is how Druid allows querying of real-time
data.
*: by real-time here we mean something which happened ~0-15 seconds ago.
8

Nodes: Historical
This guy is responsible for loading data
(hereinafter, segments) from Deep Storage and
serving this data via HTTP API (same as Real-time
dude).
Historical Node uses ZooKeeper to understand
what segments to load.
It uncompresses segment data and caches it
locally on FS.
9

Nodes: Coordinator
Node which manages segments and coordinates
their distribution on historical nodes. It uses
ZooKeeper for:
• getting current cluster state
• assigning segments to Historical Nodes
Loading and dropping of segments is managed
via Rules stored in Metadata Storage.
10

Nodes: Broker
This is the only Node which is usually touched by
client applications, because it exposes HTTP API
for querying segments data.
It splits incoming query onto smaller ones based
on segments data stored ZooKeeper and queries
corresponding Historical and Real-time Nodes.
11

Nodes: Overlord
Indexing service powers Druid’s batch data
ingestion. It consists of three node types: Overlord,
Middle Manager and Peons.
Overlord accepts indexing tasks and distributes
those to Middle Manager Nodes. It communicates
the latter ones through ZooKeeper.
It provides simple HTTP API to create, shutdown
and view status of indexing tasks.
12

Nodes: Middle-Manager
Middle-manager executes submitted indexing
tasks. It creates separate JVMs i.e. Peon Node
processes for this. Each Peon runs one task at a
time.
MM retrieves indexing tasks from ZooKeeper.
Then it stores indexing task as JSON ﬁle and runs
Peon providing it with path to the indexing task ﬁle.
After processing Peon stores segments data in
Deep Storage.
13

Live example
• Run all nodes locally
• Local FS is used for Deep Storage
• Derby is used for Metadata Storage
• ZooKeeper … well, it’s ZooKeeper, gotta run it
• Kafka 0.8.7.6.5.4.3.2.1 for streaming data <3
• Various Bash and Node scripts for loading data
• Imply Pivot for visualisation of queried results
14

Problems?
Everybody can tell good sides of Druid. Issues
time!
• DevOps needed to spin up all the Nodes and
dependencies (Did we clone Dmytro already?)
• Limited number of aggregations
• Druid loves cookies… no, space, it needs space,
more space, and it’s not enough anyways.
15

Thank you
Questions?
Fun time?
Beer for me?
16

What's hot

Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Globus

Elastic - ELK, Logstash & KibanaSpringPeople

Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus

What you need to know for postgresql operationAnton Bushmelev

Log management with ELKGeert Pante

High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent

Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus

Big Data and Machine Learning with FIWAREFernando Lopez Aguilar

Gateways 2020 Tutorial - Introduction to GlobusGlobus

Globus Portal Framework (APS Workshop)Globus

from source to solution - building a system for event-oriented dataEric Sammer

ZookeeperSatyaHadoop

What's hot (12)

Automating Research Data Flows with Globus (CHPC 2019 - South Africa)

Elastic - ELK, Logstash & Kibana

Gateways 2020 Tutorial - Large Scale Data Transfer with Globus

What you need to know for postgresql operation

Log management with ELK

High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...

Gateways 2020 Tutorial - Instrument Data Distribution with Globus

Big Data and Machine Learning with FIWARE

Gateways 2020 Tutorial - Introduction to Globus

Globus Portal Framework (APS Workshop)

from source to solution - building a system for event-oriented data

Zookeeper

Similar to Druid Time-Series Data Store Explained

Web BasicsHui Xie

ThroughTheLookingGlass_EffectiveObservability.pptxGrace Jansen

Integrate Ruby on Rails with Avectra's NetFORUM xWeb API.Thomas Vendetta

Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014Arun Gupta

Elements for an iOS BackendLaurent Cerveau

Distributed tracing with OpenTracing and Jaeger @ getstream.ioMax Klyga

Oracle WebLogic Diagnostics & Perfomance tuningMichel Schildmeijer

NetflixOSS Open House Lightning talksRuslan Meshenberg

Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc

John adams talk cloudyJohn Adams

Meetup on Apache ZookeeperAnshul Patel

Solr at zvents 6 years later & still going stronglucenerevolution

Real time monitoring of hadoop and spark workflowsShankar Manian

Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut

Сергей Радзыняк ".NET Microservices in Real Life"Fwdays

Oracle Fuson Middleware Diagnostics, Performance and TroubleshootMichel Schildmeijer

Introduction to Kafka and ZookeeperRahul Jain

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion

AvalancheProject2012fishetra

Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks

Similar to Druid Time-Series Data Store Explained (20)

Web Basics

ThroughTheLookingGlass_EffectiveObservability.pptx

Integrate Ruby on Rails with Avectra's NetFORUM xWeb API.

Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014

Elements for an iOS Backend

Distributed tracing with OpenTracing and Jaeger @ getstream.io

Oracle WebLogic Diagnostics & Perfomance tuning

NetflixOSS Open House Lightning talks

Running Airflow Workflows as ETL Processes on Hadoop

John adams talk cloudy

Meetup on Apache Zookeeper

Solr at zvents 6 years later & still going strong

Real time monitoring of hadoop and spark workflows

Exploring Twitter's Finagle technology stack for microservices

Сергей Радзыняк ".NET Microservices in Real Life"

Oracle Fuson Middleware Diagnostics, Performance and Troubleshoot

Introduction to Kafka and Zookeeper

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...

AvalancheProject2012

Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...

Recently uploaded

DNT_Corporate presentation know about usDynamic Netsoft

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

chapter--4-software-project-planning.pptkotipi9215

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

cybersecurity notes for mca students for learningVitsRangannavar

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Professional Resume Template for Software DevelopersVinodh Ram

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Recently uploaded (20)

DNT_Corporate presentation know about us

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

chapter--4-software-project-planning.ppt

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

cybersecurity notes for mca students for learning

Engage Usergroup 2024 - The Good The Bad_The Ugly

Optimizing AI for immediate response in Smart CCTV

XpertSolvers: Your Partner in Building Innovative Software Solutions

The Evolution of Karaoke From Analog to App.pdf

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Professional Resume Template for Software Developers

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Salesforce Certified Field Service Consultant

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

why an Opensea Clone Script might be your perfect match.pdf

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Hand gesture recognition PROJECT PPT.pptx

Druid Time-Series Data Store Explained

1. Druid, or There and Back Again

2. Is it about woods creature? Nope. It’s about… a data store designed for OLAP queries on event (time-series) data. 2

3. Boring facts • Open-source, community-driven project • ~3400 stars, ~7300 commits on Github ATM • Written in Java • Very modular / extendable thanks to Guice 3

4. What it does? • Ingests stream / batch time-series data and splits it into segments representing conﬁgured time interval • During ingestion, it performs aggregation using provided algorithms to create metrics • Allows to perform several types of queries over served segments via nice HTTP API (here you can do simple aggregations over metrics) 4

5. How it works? 5

6. How it works? Druid consists of several types of services (hereinafter, nodes) which together make it complete working system: • Broker • Historical • Real-time • Coordinator • Overlord, Middle Manager, Peons, etc. (aka boring nodes) 6

7. How it works? It also has three external dependencies: • Apache ZooKeeper for conﬁguration management, leader election and data ﬂow organisation (coordinator - historical communication) • Metadata Storage such as MySQL, PostgreSQL used to store (guess what?) various metadata about the system • Deep Storage where all the compressed data is stored (S3, HDFS, FS) 7

8. Nodes: Real-time It is responsible for ingestion of streaming data. Also it exposes HTTP API for querying, so data is available for querying right after processing / aggregation. That is how Druid allows querying of real-time data. *: by real-time here we mean something which happened ~0-15 seconds ago. 8

9. Nodes: Historical This guy is responsible for loading data (hereinafter, segments) from Deep Storage and serving this data via HTTP API (same as Real-time dude). Historical Node uses ZooKeeper to understand what segments to load. It uncompresses segment data and caches it locally on FS. 9

10. Nodes: Coordinator Node which manages segments and coordinates their distribution on historical nodes. It uses ZooKeeper for: • getting current cluster state • assigning segments to Historical Nodes Loading and dropping of segments is managed via Rules stored in Metadata Storage. 10

11. Nodes: Broker This is the only Node which is usually touched by client applications, because it exposes HTTP API for querying segments data. It splits incoming query onto smaller ones based on segments data stored ZooKeeper and queries corresponding Historical and Real-time Nodes. 11

12. Nodes: Overlord Indexing service powers Druid’s batch data ingestion. It consists of three node types: Overlord, Middle Manager and Peons. Overlord accepts indexing tasks and distributes those to Middle Manager Nodes. It communicates the latter ones through ZooKeeper. It provides simple HTTP API to create, shutdown and view status of indexing tasks. 12

13. Nodes: Middle-Manager Middle-manager executes submitted indexing tasks. It creates separate JVMs i.e. Peon Node processes for this. Each Peon runs one task at a time. MM retrieves indexing tasks from ZooKeeper. Then it stores indexing task as JSON ﬁle and runs Peon providing it with path to the indexing task ﬁle. After processing Peon stores segments data in Deep Storage. 13

14. Live example • Run all nodes locally • Local FS is used for Deep Storage • Derby is used for Metadata Storage • ZooKeeper … well, it’s ZooKeeper, gotta run it • Kafka 0.8.7.6.5.4.3.2.1 for streaming data <3 • Various Bash and Node scripts for loading data • Imply Pivot for visualisation of queried results 14

15. Problems? Everybody can tell good sides of Druid. Issues time! • DevOps needed to spin up all the Nodes and dependencies (Did we clone Dmytro already?) • Limited number of aggregations • Druid loves cookies… no, space, it needs space, more space, and it’s not enough anyways. 15

16. Thank you Questions? Fun time? Beer for me? 16

Druid Time-Series Data Store Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Druid Time-Series Data Store Explained

Similar to Druid Time-Series Data Store Explained (20)

Recently uploaded

Recently uploaded (20)

Druid Time-Series Data Store Explained