Understanding apache-druid

Interactive Analytics at Scale
By:-Suman Banerjee

Druid Concepts
• What is it ?
Druid is a open source fast distributed column-oriented data store.
Designed for low latency ingestion and very fast ad-hoc aggregation based
analytics.
• Pros:-
• Fast response in aggregation operation (in almost sub-second)
• Supports real time streaming ingestion capability with many other popular
solution in market e.g. Kafka , Samza , Spark etc.
• Traditional Batch type ingestion ( Hadoop based ).
• Cons / Limitation :-
• joins are not mature enough
• Limited options compared to other SQL like solutions.

Brief History on Druid
• History
• Druid was started in 2011 to power the analytics in Metamarkets. The project
was open-sourced to an Apache License in February 2015.

Industries has Druid in production
• Metamarkets
• Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over
30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at query time of
around 1 second in almost 95% of the time.
• Airbnb
• Druid powers slice and dice analytics on both historical and real time-time metrics. It significantly reduces latency of analytic queries
and help people to get insights more interactively.
• Alibaba
• At Alibaba Search Group, we use Druid for real-time analytics of users' interaction with its popular e-commerce site.
• Cisco
• Cisco uses Druid to power a real-time analytics platform for network flow data.
• eBay
• eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over
100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100
concurrent queries without impacting ingest rate and query latencies.

Druid In Production - MetaMarkets
• 3M+ events/sec through Druid’s real time ingestion.
• 100+ PB of data
• Application supporting 1000 of queries per sec concurrently.
• Supports 1000 of cores for horizontally scale up.
• …
• Reference :- https://metamarkets.com/2016/impact-on-query-speed-from-
forced-processing-ordering-in-druid/
• https://metamarkets.com/2016/distributing-data-in-druid-at-petabyte-
scale/

A real example of Druid in Action
Reference :- https://whynosql.com/2015/11/06/lambda-architecture-with-druid-at-gumgum/

Ideal requirements to Druid ?
• You need :-
• Fast aggregation & arbitrary data exploration in low latency on huge data sets.
• Fast response on near real time event data. Ingested data is immediately
available for querying)
• No SPoF
• Handle peta-bytes of data with multiple dimension.
• Less than a second in time-oriented summarization of the incoming data
stream
• NOTE >> before we go to understand the architecture part of it , I want to
show u a typical use case just to understand what we have said so far.

Druid Concepts – An example
• The Data
• timestamp publisher advertiser gender country click price
• 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
• 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01

Druid – Architecture
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
H
D
F
S
Deep Storage
H
D
F
S
Over lord node
Indexing
DATA
QUERY
Static Data

H
D
F
S
Druid – Architecture ( cluster mgmt. depedency)
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
Deep Storage
H
D
F
S
Coordinator node
DATA
QUERY
Zookeeper
Meta Store

Druid – Components
• Broker Node
• Real time node
• Overlord Node
• Middle-Manager Node
• Historical Node
• Coordinator Node
• Aside from these nodes, there are 3 external dependencies to the system:
• A running ZooKeeper cluster for cluster service discovery and maintenance of current
data topology
• A metadata storage instance for maintenance of metadata about the data segments that
should be served by the system
• A "deep storage" system to hold the stored segments.

Druid - Data Storage Layer
• Segments and Data Storage
• Druid stores its index in segment files, which are partitioned by time
• columnar: the data for each column is laid out in separate data structures.

Druid – Query
• Timeseries
• TopN
• GroupBy & Aggregations
• Time Boundary
• Search
• Select
• a) queryType
• b) granularity
• c) filter
• d) aggregation
• e) post-Aggregation

Task Submit Commands
• 1- clear HDFS storage location
• hdfs dfs –rm –r /user/root/segments
• Make sure the data source is exist in local FS :-
/root/labtest/druid_hadoop/druid-0.10.0/quickstart/Test/pageviewsLatforCountExmaple.json & upload to HDFS.
hdfs dfs -put -f pageviewsLat.json /user/root/quickstart/Test
• Create index Task on Druid
• curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/Test/pageviewsLat-index-forCountExample.json
localhost:8090/druid/indexer/v1/task
• Task information can be seen in <overlord_Host:8090>/console.html
• Verify the segments is created under /user/root/segments
• hdfs dfs –ls /user/root/segments

Query commands
• TopN
• This will result Top N pages with latency in descending order.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @quckstart/Test/query/pageviewsLatforCount-top-
latency-pages.json http://localhost:8082/druid/v2/?pretty
• Timeseries
• This will result total latency , filtered by user=“alice” and "granularity": "day“ . [ “all” ]
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-timeseries-
pages.json http://localhost:8082/druid/v2/?pretty
• groupBy
• A) This is will result aggregated latency grpBy user+url
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-aggregateLatencyGrpByURLUser.json
http://localhost:8082/druid/v2/?pretty
• B) This will result aggregated page count (i.e. number of url accessed ) grpBy user
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-countURLAccessedGrpByUser.json
http://localhost:8082/druid/v2/?pretty

Query commands
• Time Boundary
• Time boundary queries return the earliest and latest data points of a data set.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-
timeBoundary-pages.json http://localhost:8082/druid/v2/?pretty
• Search
• A search query returns dimension values that match the search specification , like e.g. here searching for dimension
url has matches with text “facebook”
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-search-
URL-pages.json http://localhost:8082/druid/v2/?pretty

Understanding apache-druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Understanding apache-druid

Similar to Understanding apache-druid (20)

Recently uploaded

Recently uploaded (20)

Understanding apache-druid