Druid deep dive

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid – Deep Dive
Kashif Khan – PS South-East

What Is Druid?
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.
Features:
 Streaming Data Ingestion
 Sub-Second Queries
 Merge Historical and Real-Time Data (Lambda Architecture)
 Approximate Computation (Approximate Histograms)
 Multi-tenant (1000s of concurrent Users)

Use Cases:
• Powering user facing Analytical Applications
• Unify Historical and Real-time events
• BI/OLAP Queries (Slice and dice and drilldown)
• Behavioral Analysis (Funnel Analysis, distinct counts)
• Exploratory Analytics/ root cause

Who Uses Druid?
Ad-hoc analytics.
High concurrency user-facing real-time slice-and-dice.
Real-time loads of 10s of billions of events per day.
Powers infrastructure anomaly detection dashboards.
Ingest rates > 2TB per hour.
Exploratory analytics on clickstream sessions.
Real-time user behavior analytics.
Primary data store for visual analytics service for the
RTB (real time bidding) space. Ingest 200bn events/day

Druid: Fast Facts
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

Streaming
Processing
Nifi
Storm
Spark
Streaming
Real-Time Analytics
Real-Time
AnalyticsEvents
Logs
Trans-
actions
Sensors
Messaging
Bus
RabbitMQ
Kafka JMS
Hive
LLAP/Spark
Druid
OLAP
HBase
Phoenix

Alternate Solution
• Relational Databases
 Star Schema: Slow, becoming Outdated
• Key/Value Stores(HBase, Cassandra,openTSDB)
 Fast writes and fast lookups.
 Schema less
 Pre-compute all queries (not always possible)
 Exponential Scaling cost (precompute all queries as data grow)
• General Compute Engine (Hadoop, Spark, SQL on Hadoop(Hive, Spark SQL,
Apache Drill, Presto )
 Hard to meet performance expectation (low-latency queries in multi-tenant
environment)
https://github.com/apache/incubator-druid/tree/master/docs/content/comparisons

Druid Architecture

A Typical Druid Deployment
Many nodes, based on
data size and #queries
HDFS or S3
Superset or Dashboards,
BI Tools
Segment
Handoff
Hadoop Index Task
Index Task
Tranquility
Kafka-Indexing-Service
Overlord+MM+
Peon
Hive

Druid: Realtime Indexing
Deep Storage
Tranquility
Coordinator
Broker
Indexing Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Push
Segments
Segment push
Segment-
cache
Historical
Segment-
cache
Historical
Spark, Flink, Storm,
Python, Java
Pull
Kafka Indexing
Service
Data Streams Segment
Handoff

Druid: Batch Indexing
 Indexing is performed by Druid Indexing Service
components:
– Overlord (N = 1)
– Middle Managers (N >= 1)
– Peons ( N > =1)
 Indexing is done via MapReduce jobs that build
Segment files.
 Batch indexing is done on data that already exists in
Deep Storage (e.g. HDFS).
 Index definition is specified via a JSON file and
submitted to the Overlord.

Ingestion Path
 Indexing Spec (json file) is submitted to Overlord through API.
 Overlord Assigns the indexing tasks to Middle Manager nodes
 MM will create peon tasks that will perform actual indexing
 The data is immediately available for query as it is indexed in MM, however, the
Segment has not yet officially been created.
 In case of new data source that doesn’t have any existing segments, the Datasource may
not be visible in Meta data store.
 As soon the Indexing task finalizes the Segment, it will update the meta data store and
hand the Segment off to deep storage and then it will be picked up by a Historical node
(historical watches Zookeeper to know what Segments it's responsible for).
 The Indexing task will complete based on what you defined as your Segment Granularity
or Task Duration ( in case of Kafka Indexing Service).

Historical Nodes
 Shared nothing architecture
 Main workhorses of druid cluster
 Load immutable read optimized segments
 Respond to queries
 Use memory mapped files
to load segments

Broker Nodes
 Keeps track of segment announcements in cluster
– (This information is kept in Zookeeper, much like Storm or HBase do.)
 Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
 Merge results from different query nodes
 (Distributed) caching layer

Coordinator Nodes
 Assigns segments to historical nodes
 Interval based cost function to
distribute segments
 Makes sure query load is uniform
across historical nodes
 Handles replication of data
 Configurable rules to load/drop data
 Overlord and Coordinator nodes
should be co-located on the same
node.

Overlord Node
 Responsible for accepting tasks, coordinating task distribution, creating locks around
tasks, and returning statuses to callers
 Can be configured to run in one of two modes – local or remote (local being default)
 Local mode is typically used for simple workflows
 In remote mode, the overlord and middle manager are run in separate processes and
you can run each on a different server.
 Remote mode is recommended if you intend to use the indexing service as the single
endpoint for all Druid indexing.
 Submitting Tasks and Querying Task Status
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/status

Middle Manager and Peon
 Enables stream ingestion in Druid.
 Event query-able as soon as it is ingested.
 Middle manager node is a worker node that executes submitted tasks.
 Middle Managers creates tasks called peons that run in separate JVMs and index the
data.
 Each Peon is capable of running only one task at a time, however, a middle manager
may have multiple peons.
 druid.worker.capacity is the number of concurrent tasks that a MiddleManager
can run.

Fault tolerance
 Historical Node: If a historical node dies, another historical node can take its place and there is no fear of
data loss.
 Coordinator: Can be run in a hot fail-over configuration. If no coordinators are running, then changes to the
data topology will stop happening (no new data and no data balancing decisions), but the system will
continue to run.
 Broker: Can be run in parallel or in hot fail-over.
 Indexing Service: Workers run with replicated ingestion tasks, coordination piece has hot fail-over.
 Middle Manager: Multiple of these can be run in parallel processing the exact same stream. They
periodically checkpoint to disk and eventually push out to deep storage.
 “Deep Storage" File System If this is not available, new data will not be able to enter the cluster, but the
cluster will continue operating as is.
 Metadata Storage: If this is not available, the Coordinator will be unable to find out about new segments in
the system, but it will continue with its current view of the segments that should exist in the cluster.

Druid Tranquility Vs Kafka Indexing Service
Tranquility Kafka-Indexing-Service
 Bundles idiomatic Java and Scala APIs to work with
with different tools Storm, Kafka, Samza, Spark, Flink
etc..
 Needs Overlord, enough Middle Managers to
indexing.
 Events with timestamps outside your configured
windowPeriod will be dropped.
 If you are using Tranquility inside Storm or Samza,
various parts of the both architectures have an at-
least-once design and can lead to duplicated events.
 Does not guarantee that your events will be
processed exactly once.
 Designed specifically only for Kafka Data Source.
 Needs Overlord, enough Middle Managers to
indexing.
 Can handle late events.
 Guarantees of exactly-once ingestion.
 Easier to use compare to Tranquility
 Need indexing job to submitted separately.

Druid: Real-time Ingestion using Kafka–Indexing-Service
{
"type": "kafka",
"dataSchema": {
"dataSource": "pvcc",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": ”eta",
"format": "auto"},
"dimensionsSpec": { "dimensions":
[ "nextLocRmlCd","destRmlCd","nextLocOutboundL4ConsNbr", "outboundL4ConsNbr",
"lastKnownLocCd", "commitDt", "product2x2HighlevelGroupDesc", "product2x2PriorityNbr",
"enrouteStatus", "adjustedPackageType", "latestTranInfacilityState", "nextLocOutboundL4ConsTypeCd” ]}}
},
"metricsSpec": [{ "name": "count", "fieldName": "count", "type": "doubleSum” }],
"granularitySpec": { "type": "uniform", "segmentGranularity": "HOUR", "queryGranularity": "NONE” }},
"tuningConfig": {
"type": "kafka",
"maxRowsPerSegment": 5000000
},"ioConfig": {
"topic": "metrics",
"consumerProperties": {
"bootstrap.servers": "kafkabroker1:6667,kafkabroker2:6667",
"security.protocol": "SASL_PLAINTEXT",
"sasl.kerberos.service.name": "kafka"
},
"taskCount": 8,
"replicas": 2,
"taskDuration": "PT1H"
}
}

Pros and Cons
Pros:
• Supports both Real-time and Batch data ingestion
• Merge Historical and Real-Time Data (Lambda Architecture)
• Sub-second query response with very high Concurrency
• In-built query caching capability
• Highly optimized storage using bitmap indexing, compression and dictionary encoding.
• Emits reach set of metrics to monitor overall system performance.
• Support roll-up while ingesting data. Very flexible Segment and Query Granularity supported.
Cons:
• Strictly time-series.
• No joins. Very limited support through lookup.
• Limited SQL support through Hive LLAP and Druid native SQL.
• No Authorization in current release (Basic authorization using druid-basic-security extension) available HDP 3.0 onwards)
• No support multiple timestamp dimensions. All dimensions are treated as string.
• Not very friendly query syntax. Needs lot of learning.

Data Structure
 Druid stores its index in segment files, which are partitioned by time.
 Druid can ingest denormalized data in JSON, CSV, or a delimited form such as
TSV, or any custom format(using Regex/JavaScript). Avro is supported
through extension.
 One segment file is created for each time interval, where the time interval is
configurable in the segmentGranularity parameter
 For druid to operate well under heavy query load, it is important for the
segment file size to be within the recommended range of 300mb-700mb
 Segment file is columnar and consists for 3 basic column types: the
timestamp column, dimension columns, and metric columns

Druid Query

Query Types
 Timeseries: Time based queries
 TopN : Equivalent to group_by + order over 1 dimension
– approximate if more than 1000 Dim values
 GroupBy
 Time boundary : queries return the earliest and latest data points of a data set
 Search queries / Select
 For each query we can use operators like
– Granularity (Roll up)
– Filters
– Aggregation / Post-Aggregation
– Etc

Group-by example

Results example

Druid Hive integration
 Data already existing in druid
 Druid has its own JSON based query
language
 No native BI tools integration
 Point hive to broker and specify data
source name
 Use Hive as a virtualization layer
 Query Druid data with SQL and plug any
BI tool
 Data already existing in Hive .
 Data stored in distributed filesystem like
HDFS, S3 in a format that can be read by
Hive eg TSV, CSV ORC, Parquet etc
 Perform some pre-processing over
various data sources before feeding it to
Druid
 Accelerate query over Hive data
 Join between hot and cold data
Query Druid from Hive with SQL Hive queries acceleration

Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries
(Timeseries, TopN, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
– Can be split to various historical in parallel or interact with Druid broker node only
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed

Query Druid from Hive with SQL
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
 Broker node endpoint specified as a Hive configuration parameter
 Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name

Query Druid from Hive Hive Table Creation
Hive Query Plan
Query with SQL

Hive queries acceleration
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" =
"HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
 Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive
column type
– Timestamp –> Time
– Dimensions -> page, user
– Metrics -> c_added, c_removed
Credit jcamacho@apache.org
Hive table name
Hive storage handler classname
Druid data source name

Index Hive in Druid
Create Druid Backed Table
SQL Druid Query
SQL Hive Query

Tech Preview: Simple Druid Management with Ambari

Druid: In-built Grafana DashBoard

Overlord UI

Best Practices
 Use UTC Timezone.
– This can greatly mitigate potential query problems with inconsistent timezones
 Use SSDs
– SSDs are highly recommended for Historical nodes when all they have more segments loaded
than available memory.
 Use JBOD
– Historicals might get improved disk throughput with JBOD.
 Use Timeseries and TopN Queries Instead of GroupBy Where Possible
– Timeseries and TopN queries are much more optimized and significantly faster than groupBy
queries for their designed use cases.
– Issuing multiple topN or timeseries queries from your application can potentially be more
efficient than a single groupBy query.
 Druid user should be part of Hadoop Group
– Make sure druid user exist on all cluster nodes regardless of druid service installed or not.
Also, druid user should be part of hadoop group

Best Practices
 Keep Historical, Broker, and MiddleManager nodes on dedicated machines. The nodes that
respond to queries will use as many cores as are available, depending on usage.
 Keep the Segment file size in the range of 300mb-700mb
 Increase ulimit for druid user
 Change the default Indexer task Base Directory location
(druid.indexer.task.baseTaskDir) from /apps/druid/tasks to mount that has
enough space.
 Use Caffeine Cache
– Set druid.cache.type=caffeine, requires JRE8u60.
– Set druid.cache.sizeInBytes (Default is ~10MB)
 Enable Broker caching
– Set druid.broker.cache.useCache= true (Default is false)
– Set druid.broker.cache.populateCache=true (Default is false)
– Adjust druid.broker.cache.unCacheable to make sure your query is cached. By
default select,groupBy query is not cached.

Best Practices - JVM Flags
These are general guidelines only. Be cautious and feel free to change them if necessary for
the specific deployment.
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=<something other than /tmp which might be mounted to
volatile tmpfs file system>
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dorg.jboss.logging.provider=slf4j
-Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger
-Dlog4j.shutdownCallbackRegistry=io.druid.common.config.Log4jShutdown
-Dlog4j.shutdownHookEnabled=true -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-Xloggc:/var/logs/druid/historical.gc.log -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m
-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/logs/druid/historical.hprof

Best Practices - JVM Heap Sizes
These are general Guidelines for production cluster. Adjust it as you see fit in your environments.
 Broker Nodes:
– Recommend 20G-30G of heap here. Uses mainly to merge results from historical and middle-
manager nodes.
 Historical Nodes:
– Recommended 1GB * (processing.numThreads) for the heap for normal usage. By
default processing.numThreads is ( total cores – 1).
 Coordinator and Overlord Node:
– Recommended 4, 1 GB respectively. Should be collocated.
 Middle Manager Nodes:
– Recommended 500MB - 1GB.
 Peon (druid.indexer.runner.javaOpts):
– Recommended 3 GB

DirectMemory Exception
Not enough direct memory. Please adjust -XX:MaxDirectMemorySize,
druid.processing.buffer.sizeBytes, or druid.processing.numThreads:
maxDirectMemory[3,506,438,144], memoryNeeded[4,294,967,296] =
druid.processing.buffer.sizeBytes[1,073,741,824] * ( druid.processing.numThreads[3] + 1 )
Resolution:
-XX:MaxDirectMemorySize = druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers +
druid.processing.numThreads + 1)
Property Name Description Default
druid.processing.buffer
.sizeBytes
This specifies a buffer size for the storage of intermediate
results. The computation engine in both the Historical and
MM nodes will use a scratch buffer of this size to do all of
their intermediate computations off-heap. Larger values
allow for more aggregations in a single pass over the data
while smaller values can require more passes depending on
the query that is being executed.
1073741824 (1GB)

Property Name Description Default
druid.processing.numMerge
Buffers
The number of direct memory buffers available for merging query
results. The buffers are sized
by druid.processing.buffer.sizeBytes. This property
is effectively a concurrency limit for queries that require merging
buffers. If you are using any queries that require merge buffers
(currently, just groupBy v2) then you should have at least two of
these.
max(2,
druid.processin
g.numThreads /
4)
druid.processing.numThr
eads
The number of processing threads to have available for parallel
processing of segments. Our rule of thumb is num_cores - 1,
which means that even under heavy load there will still be one core
available to do background tasks like talking with ZooKeeper and
pulling down segments. If only one core is available, this property
defaults to the value 1.
Number of cores
- 1 (or 1)
druid.broker.http.numCo
nnections
Size of connection pool for the Broker to connect to historical and
real-time processes. If there are more queries than this number
that all need to speak to the same node, then they will queue up.
20
Important Configurations

Why Use Druid From Hortonworks?
With HDP Druid Alone
Interactive Analytics ✓ ✓
Analyze Data Streams ✓ ✓
Spatial Analytics ✓ ✓
Horizontally Scalable ✓ ✓
SQL:2011 Interface ✓ ✖
Join Historical and Real-time Data ✓ ✖
Management and Monitoring with Ambari ✓ ✖
Managed Rolling Upgrades ✓ ✖
Visualization with Superset ✓ ✖
Easy App Development with Hortonworks SAM ✓ ✖

Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management

References
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
https://github.com/apache/incubator-druid/tree/master/docs/content/comparisons
http://druid.io

Thanks You!

Druid deep dive

More Related Content

What's hot

Similar to Druid deep dive

Recently uploaded

Druid deep dive