Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics and why the architecture is well suited to power analytic applications.
User-facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity without understanding the possible architecture limitations.
Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google’s Dremel, Google’s PowerDrill, and search infrastructure. Many enterprises are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Motivation
Druid introduction and use case
Demo
Druid Architecture
Storage Internals
Recent Improvements
Initial Use Case
Power ad-tech analytics product at metamarkets. Similar to as shown in the picture in the right, A dashboard where you can visualize timeseries data and do arbitrary filtering and grouping on any combinations of dimensions.
Requirements
- Data store needs to support Arbitrary queries i.e users should be able to filter and group on any combination of dimensions.
Scalability : should be able to handle trillions of events/day
Interactive : since the data store was going to power and interactive dashboard low latency queries was must
Real-time : the time when between an event occurred and it is visible dashboard should be mininal (order of few seconds..)
High Availability – no central point of failure
Rolling Upgrades – the architecture was required to support Rolling upgrades
MOTIVATION
Interactive real time visualizations on Complex data streams
Answer BI questions
How many unique male visitors visited my website last month ?
How many products were sold last quarter broken down by a demographic and product category ?
Not interested in dumping entire dataset
Suppose I am running an ad campaign, and I want to understand
what kind of Impressions are there
What is my click through rate
How many users decided to purchase my services
We have User Activity Stream and we may want to know How the users are behaving.
We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime.
Also, For very large distributed clusters there is a need to answer questions about application performance.
How individual node in my cluster behaving ?
Are there any Anomalies in query response time ?
All the above use cases can have data streams which can be huge in volume depending on the scale of business.
How do I analyze this information ?
How do I get insights from these Stream of Events in realtime ?
What is Druid ?
Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns.
Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times
Realtime streaming ingestion from almost any ETL pipeline.
Arbitrary slicing and dicing of data – no need to create pre-canned drill downs
Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time.
Approximate algorithms (hyperLogLog, theta) – for fast approximate answers
Scalable to petabytes of data
Highly available
This shows some of the production users.
I can talk about some of the large ones which have common use cases.
Alibaba and Ebay use druid for ecommerce and user behavior analytics
Cisco has a realtime analytics product for analyzing network flows
Yahoo uses druid for user behavior analytics and realtime cluster monitoring
Hulu does interactive analysis on user and application behavior
Paypal, SK telecom – uses druid for business analytics
Realtime Nodes -
Handle Real-Time Ingestion, Support both pull & push based ingestion.
Store data in Row Oriented write optimized Structure
Periodically convert write optimized structure read optimized Structure
Ability to serve queries as soon as data is ingested.
Historical Nodes -
Main workhorses of druid clatter
Use Memory Mapped files to load columnar data
Respond to User queries
Broker Nodes -
Keeps track of which node is service which portion of data
Ability to scatter query across multiple Historical and Realtime nodes
Caching Layer
Druid has concept of different nodes, where each node is designed and optimized to perform specific set of tasks.
Realtime Index Tasks / Realtime Nodes-
Handle Real-Time Ingestion, Support both pull & push based ingestion.
Handle Queries - Ability to serve queries as soon as data is ingested.
Store data in write optimized data structure on heap, periodically convert it to write optimized time partitioned immutable segments and persist it to deep storage.
In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL and send your massaged data to druid.
Deep storage can be any distributed FS and acts as a permanent backup of data
Historical Nodes -
Main workhorses of druid cluster
Use Memory Mapped files to load immutable segments
Respond to User queries
Now Lets see the how data can be queried.
Broker Nodes -
Keeps track of the data chunks being loaded by each node in the cluster
Ability to scatter query across multiple Historical and Realtime nodes
Caching Layer
Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid
Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into time partitioned segments and persist it to deep storage.
With many historical nodes in a cluster there is a need for balance the load across them, this is done by the Coordinator Nodes -
Uses Zookeeper for coordination
Asks historical Nodes to load or drop data
They also move data across historical nodes to balances load in the cluster
Manages Data replication
External Dependencies –
Metadata Storage – for storing metadata about the segments i.e the location of segments, information on how to load the segments etc.
Memcache/ Redis cache – you can optionally add a memcache or redis cache which can be used to cache partial query results.
Druid: Segments
Data in Druid is stored in Segment Files.
Partitioned by time
Ideally, segment files are each smaller than 1GB.
If files are large, smaller time partitions are needed.
Example Wikipedia Edit Dataset
Data Rollup
Rollup by hour
Dictionary Encoding
Create and store Ids for each value
e.g. page column
Values - Justin Bieber, Ke$ha, Selena Gomes
Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
Column Data - [0 0 0 1 1 2]
city column - [0 0 0 1 1 1]
Bitmap Indices
Store Bitmap Indices for each value
Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
Selena Gomes -> [5] -> [0 0 0 0 0 1]
Queries
Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
Indexes compressed with Concise or Roaring encoding
Data Rollup
Rollup by hour
Indexing Service is highly-available, distributed service that runs indexing related tasks.
The indexing service is composed of three main components:
Overlord - responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers.
Middle Managers - The middle manager node is a worker node that executes submitted tasks. they launch peons that actually runs the tasks.
Peons – managed by middlemanagers and runs a single task. It gets a task definition which is a json spec file that describes the task to perform.
All the coordination and communication for task assignment, announcing task stustuses is done via zookeeper.
Streaming Ingestion
Done by Realtime Index Tasks
Ability to ingest streams of data
Stores data in write-optimized structure – row oriented key-value store Indexed by time and dimension values
Periodically based on either a time interval or threshold on number of rows it converts write-optimized structure to read-optimized segments
Event query-able as soon as it is ingested
Both push and pull based ingestion
Tranquility is a helper library in druid which provides easy coordination and task management for performing streaming ingestion into druid
It has a very simple API which you can use to send events to druid.
On the right hand side you can see a simple example sending an event to druid. So you just create a Tranquilzer with config,
The config contains the location of druid overlord, name of your datasource and other ingestion related properties.
Just simply call send on the tranquilizer, it will automatically takes care of creating a druid task, managing lifecycle of the task, discovering location of the task and sending data to that task.
We have also added an experimental support for ingesting data from Kafka that also supports exactly once consumption of data.
How kafka works is as follows –
Each message written to Kafka is placed into an ordered and immutable sequence called a partition and is assigned a sequentially incrementing identifier called an offset.
Messages are pulled by druid tasks which verify the sequence and offsets to ensure the sequence.
Then at time of persisting the data both the segments and information related kafka offsets is persisted in a single transaction.
Since we have the offsets in the metadata in case of failures, we can start reading from that offset again.
Batch Ingestion – ingested data in batch.
HadoopIndexTask
Peon launches Hadoop MR job
Mappers read data
Reducers create Druid segment files
Index Task
Suitable for data sizes(<1G)
Druid broker nodes exposes HTTP endpoints where users can post the queries
Queries and results expressed in JSON
Multiple Query Types
On the right we have an example of a groupBy query in the json you can see
In the json query you can specify the datasource, granularity – time by which you want to bucket your data, any filter you may want to use,
List of aggregations that you need to perform and any post aggregations like average etc.
The second and easier way to query druid is using SQL (suport for inbuilt SQL is experimental at present)
We leverage apache calcite for parsing and planning the query.
It also uses Avatica which is a framework for building JDBC drivers for databases
So using this, you can connect any BI tool that supports JDBC to druid.
Druid also defines some new operators for supporting approximate queries,
Retention analysis
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
Query performance – query time, segment scan time …
Ingestion Rate – events ingested, events persisted …
JVM Health – JVM Heap usage, GC stats …
Cache Related – cache hits, cache misses, cache evictions …
System related – cpu, disk, network, swap usage etc..
No Downtime
Data redundancy
Rolling upgrades
You can secure Druid nodes using Kerberos, and use SPNEGO mechanism to interact with druid HTTP end points.
Summary
It is easy to install and manage druid via Ambari
Realtime with ingestion and query latency of the order of few secs.
Arbitrary slicing and dicing of data
Summary
It is easy to install and manage druid via Ambari
Realtime with ingestion and query latency of the order of few secs.
Arbitrary slicing and dicing of data
Guice which is a lightweight dependency injection framework