This is a real world account from a Druid cluster in production. A story of 48 hours of debugging, learning and understanding Druid better, filing a couple of issues in Druid github and finally a stable production pipeline again thanks to the Druid community.
We will discuss the bottlenecks we had in overlord, slot issues for Peons in middle managers, coordinator bottlenecks, how we mitigated task and segment flooding, what configs we changed sprinkled with real world numbers and snapshots from our grafana dashboards.
4. Contents
Druid 101
How we use Druid
Re-architecture : What & Why
Impact On Druid components
How we fixed the issues
State of Bugs we filed / fixed
5. Druid 101
• Open-source, Apache 2.0 License and under Apache Foundation
• Columnar data store designed for high-performance.
• Supports Real-time and Batch ingestion.
• Segment Oriented Storage
• Distributed and modular architecture, horizontally scalable for most
parts
• Supports Data tiering – Keep cold data in cheaper storage!
6. What we love about Druid!
Modularity - Separation of Concerns
Modularity – Simplicity* : Ease to deploy , Upgrade, Migrate, Manage
Modularity – Flexibility - Scale only what you need, Retain based on retention rules on tiers
Modularity - Built for Cloud
Durability – Object Store (S3 or Nutanix Objects for instance) for Deep Storage
Durability - SQL database for metadata
Admin Dashboard – easier debugging and monitoring
8. Ingestion & Query Patterns
● IPFix log files are collected from clouds.
○ IPFIX : IP Flow Information Export
○ Summarizes network data packets to track IP actions
● We enrich data and store in an s3 bucket.
● S3 data is ingested into druid.
● Serves Analytics dashboards in slice and dice manner.
● Used for ML engine as well.
9. Druid Nos : 3+ years in Prod
Last 24 hrs
Cluster Size
10.
11. Data Model for our Apps
● Analytics Apps as part of Nutanix Dashboard
● Customers can slice and dice data given some filters
● Multi-tenant Use Case
● Druid Data source per customer per use case
● Enable features for some data sources
○ Phased rollout for new Druid features
○ Druid Version Upgrades
○ App redesign requiring Change in Druid ingestion or query.
● Workflow engine (Temporal) for pipeline.
● Java based Workers backed by Postgres storage for state.
12. Change in Requirements
● Change in Requirement: Batch (3 hours) to 5 minutes
● Earlier:
○ Agent collects data, dumps to S3.
○ Cron runs every 3 hour, ingests from S3 to Druid
○ SLA : 3 hours
● New Design:
○ SLA : 15 minutes
○ Agent collects data, dumps to S3 every 5 minutes.
○ Ingestion Pipeline ingests to Druid depending on what Druid likes.
○ Ingestion Pipeline gobbles backpressure.
● Release Plan
○ Data sources uploaded to cluster in a phased manner
26. Summary: When Druid was struggling (Overlord on
fire)
● Ingested smaller, but more tasks.
● onboarding a few large datasources, fine for a day
● More confidence
● Onboarded all datasource at once
○ Task queue kept increasing (till 25K). Overlord overwhelmed after 5K
○ Soon, overlord machine CPU usage at 100%
● All the tasks were stuck in pending state
● Task count was 12x more than previous but smaller.
● Middle managers were sitting idle, no incoming tasks.
● Task state were not updating properly as overlord was overwhelmed.
Druid Overlord
31. Handling the Overlord…
● Vertically scale overlord. Didn’t help! No support for horizontal
scaling.
● Changed configs:
32. Handling the Overlord…
● Vertically scale overlord. Didn’t help! No support for horizontal
scaling.
● Changed configs: No
ZK for
assignment
Druid.indexer.runner.type : httpRemote
33. Handling the Overlord…
● Vertically scale overlord. Didn’t help! No support for horizontal
scaling.
● Changed configs:
Throttle,
Don’t give up
Druid.indexer.runner.type : httpRemote
Druid.indexer.queue.maxSize : 5000
34. Handling the Overlord…
● Vertically scale overlord. Didn’t help! No support for horizontal
scaling.
● Changed configs:
● Set max pending tasks per datasource for an interval to 1
Throttle,
Don’t give up
Druid.indexer.runner.type : httpRemote
Druid.indexer.queue.maxSize : 5000
GET /druid/indexer/v1/pendingTasks?datasource=ds1
37. Making DB functional…
● Queries from overlord to Postgres for
task metadata were taking long time.
● Add more CPU to DB server
● Improvements:
○ Overlord CPU utilization is less
○ Number of pending tasks are less
○ Task slot utilization graph looks stable
44. Summary : Scaling Middle Manager
● Increased number of middle manager as so
that more task slots are available for
overlord to assign tasks.
● Then we increased number of slots per
middle manager as new tasks were small
i.e. having less number of files to ingest.
● We created a separate tier for compaction
as these tasks took more resource then the
current index tasks.
● Then we right sized the middle manager
count in each tier by reducing it.
12 MMs * 5 slots => 24MMs * 5 slots
24 MMs * 5 slots => 12MMs * 10 slots
12 MMs * 10 slots =>
10 MMs * 10 slots + 2 MMs *5 slots
Tiering
49. Summary of Coordinator crisis…
● Happy Overlord.
● But issues in Coordinator now:
○ Huge number of small segments.
○ Unavailable segments count increasing.
○ Coordinator CPU usage increasing
○ Coordinator cycle is taking too long to complete
54. Handling the Coordinator…
● Increased Coordinator instance type as it is not scalable
horizontally
● Tried the following coordinator dynamic configs:
55. Handling the Coordinator…
● Increased Coordinator instance type as it is not scalable
horizontally
● Tried few coordinator dynamic configs:
maxSegmentsToMove: 1000
percentOfSegmentsToConsiderPerMove: 25
reducing the
number of
segments per
coordinator
cycle
56. Handling the Coordinator…
● Increased Coordinator instance type as it is not scalable
horizontally
● Tried few coordinator dynamic configs:
maxSegmentsToMove: 1000
percentOfSegmentsToConsiderPerMove: 25
Assign segments
In round-robin
fashion first.
Lazily reassign with
chosen balancer
strategy later
useRoundRobinSegmentAssignment: true
57. Handling the Coordinator…
● We saw this error in coordinator logs during
auto compaction for many datasources.
“is larger than inputSegmentSize[2147483648]”
● Removing this setting from auto compaction
config resolved the issue.
● This is no longer an issue Druid 25 onwards.
inputSegmentSizeBytes: 100TB
58. Handling the Historicals
● Until auto compaction done:
○ More no of segments for queries
○ More processing power for historicals
● Cold data has HIGHER segment
granularity
○ Compaction Done!
● Hot data has LOWER segment
granularity
○ Compaction NOT done YET!
Query for
recent data
Query for
recent data
Older Historicals
Current Historicals
Larger segments
Smaller segments
Datasource 2
Datasource 1
Datasource 1
Datasource 2
60. Summary
● Once we stabilized Druid Ingestion and query both pipelines we
onboarded all customers in a phased manner.
● Set the optimal queue size.
● To absorb the initial burst of tasks we increased MM count.
● Right size Overlord and coordinator once the onboarding was
complete
● Do know overlord and coordinator settings well.