SlideShare a Scribd company logo
Interactive Analytics at Scale
By:-Suman Banerjee
Druid Concepts
• What is it ?
Druid is a open source fast distributed column-oriented data store.
Designed for low latency ingestion and very fast ad-hoc aggregation based
analytics.
• Pros:-
• Fast response in aggregation operation (in almost sub-second)
• Supports real time streaming ingestion capability with many other popular
solution in market e.g. Kafka , Samza , Spark etc.
• Traditional Batch type ingestion ( Hadoop based ).
• Cons / Limitation :-
• joins are not mature enough
• Limited options compared to other SQL like solutions.
Brief History on Druid
• History
• Druid was started in 2011 to power the analytics in Metamarkets. The project
was open-sourced to an Apache License in February 2015.
Industries has Druid in production
• Metamarkets
• Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over
30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at query time of
around 1 second in almost 95% of the time.
• Airbnb
• Druid powers slice and dice analytics on both historical and real time-time metrics. It significantly reduces latency of analytic queries
and help people to get insights more interactively.
• Alibaba
• At Alibaba Search Group, we use Druid for real-time analytics of users' interaction with its popular e-commerce site.
• Cisco
• Cisco uses Druid to power a real-time analytics platform for network flow data.
• eBay
• eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over
100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100
concurrent queries without impacting ingest rate and query latencies.
Industries …
Druid In Production - MetaMarkets
• 3M+ events/sec through Druid’s real time ingestion.
• 100+ PB of data
• Application supporting 1000 of queries per sec concurrently.
• Supports 1000 of cores for horizontally scale up.
• …
• Reference :- https://metamarkets.com/2016/impact-on-query-speed-from-
forced-processing-ordering-in-druid/
• https://metamarkets.com/2016/distributing-data-in-druid-at-petabyte-
scale/
A real example of Druid in Action
Reference :- https://whynosql.com/2015/11/06/lambda-architecture-with-druid-at-gumgum/
Ideal requirements to Druid ?
• You need :-
• Fast aggregation & arbitrary data exploration in low latency on huge data sets.
• Fast response on near real time event data. Ingested data is immediately
available for querying)
• No SPoF
• Handle peta-bytes of data with multiple dimension.
• Less than a second in time-oriented summarization of the incoming data
stream
• NOTE >> before we go to understand the architecture part of it , I want to
show u a typical use case just to understand what we have said so far.
Druid Concepts – An example
• The Data
• timestamp publisher advertiser gender country click price
• 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
• 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62
• 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45
• 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
• 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
• 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
Druid – Architecture
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
H
D
F
S
Deep Storage
H
D
F
S
Over lord node
Indexing
DATA
QUERY
Static Data
H
D
F
S
Druid – Architecture ( cluster mgmt. depedency)
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
Deep Storage
H
D
F
S
Coordinator node
DATA
QUERY
Zookeeper
Meta Store
Druid – Components
• Broker Node
• Real time node
• Overlord Node
• Middle-Manager Node
• Historical Node
• Coordinator Node
• Aside from these nodes, there are 3 external dependencies to the system:
• A running ZooKeeper cluster for cluster service discovery and maintenance of current
data topology
• A metadata storage instance for maintenance of metadata about the data segments that
should be served by the system
• A "deep storage" system to hold the stored segments.
Druid - Data Storage Layer
• Segments and Data Storage
• Druid stores its index in segment files, which are partitioned by time
• columnar: the data for each column is laid out in separate data structures.
Druid – Query
• Timeseries
• TopN
• GroupBy & Aggregations
• Time Boundary
• Search
• Select
• a) queryType
• b) granularity
• c) filter
• d) aggregation
• e) post-Aggregation
Demo
Task Submit Commands
• 1- clear HDFS storage location
• hdfs dfs –rm –r /user/root/segments
• Make sure the data source is exist in local FS :-
/root/labtest/druid_hadoop/druid-0.10.0/quickstart/Test/pageviewsLatforCountExmaple.json & upload to HDFS.
hdfs dfs -put -f pageviewsLat.json /user/root/quickstart/Test
• Create index Task on Druid
• curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/Test/pageviewsLat-index-forCountExample.json
localhost:8090/druid/indexer/v1/task
• Task information can be seen in <overlord_Host:8090>/console.html
• Verify the segments is created under /user/root/segments
• hdfs dfs –ls /user/root/segments
Query commands
• TopN
• This will result Top N pages with latency in descending order.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @quckstart/Test/query/pageviewsLatforCount-top-
latency-pages.json http://localhost:8082/druid/v2/?pretty
• Timeseries
• This will result total latency , filtered by user=“alice” and "granularity": "day“ . [ “all” ]
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-timeseries-
pages.json http://localhost:8082/druid/v2/?pretty
• groupBy
• A) This is will result aggregated latency grpBy user+url
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-aggregateLatencyGrpByURLUser.json
http://localhost:8082/druid/v2/?pretty
• B) This will result aggregated page count (i.e. number of url accessed ) grpBy user
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-countURLAccessedGrpByUser.json
http://localhost:8082/druid/v2/?pretty
Query commands
• Time Boundary
• Time boundary queries return the earliest and latest data points of a data set.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-
timeBoundary-pages.json http://localhost:8082/druid/v2/?pretty
• Search
• A search query returns dimension values that match the search specification , like e.g. here searching for dimension
url has matches with text “facebook”
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-search-
URL-pages.json http://localhost:8082/druid/v2/?pretty
Thank You

More Related Content

What's hot

Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
Kashif Khan
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
DataWorks Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
DataWorks Summit
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
DataStax
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
DataWorks Summit
 
Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01
gianmerlino
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
Imply
 
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
JupyterCon 2020 - Supercharging SQL Users with Jupyter NotebooksJupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
Michelle Ufford
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015
StampedeCon
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
Treasure Data, Inc.
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
DataWorks Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
J Singh
 

What's hot (20)

Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
 
Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
JupyterCon 2020 - Supercharging SQL Users with Jupyter NotebooksJupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
JupyterCon 2020 - Supercharging SQL Users with Jupyter Notebooks
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 

Similar to Understanding apache-druid

Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
c-bslim
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
Jungsu Heo
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
Jared Winick
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
DataWorks Summit
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
Fastly
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Amazon Web Services
 
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
Amazon Web Services
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
Sri Ambati
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Imply
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Lucidworks
 

Similar to Understanding apache-druid (20)

Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Understanding apache-druid

  • 1. Interactive Analytics at Scale By:-Suman Banerjee
  • 2. Druid Concepts • What is it ? Druid is a open source fast distributed column-oriented data store. Designed for low latency ingestion and very fast ad-hoc aggregation based analytics. • Pros:- • Fast response in aggregation operation (in almost sub-second) • Supports real time streaming ingestion capability with many other popular solution in market e.g. Kafka , Samza , Spark etc. • Traditional Batch type ingestion ( Hadoop based ). • Cons / Limitation :- • joins are not mature enough • Limited options compared to other SQL like solutions.
  • 3. Brief History on Druid • History • Druid was started in 2011 to power the analytics in Metamarkets. The project was open-sourced to an Apache License in February 2015.
  • 4. Industries has Druid in production • Metamarkets • Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over 30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at query time of around 1 second in almost 95% of the time. • Airbnb • Druid powers slice and dice analytics on both historical and real time-time metrics. It significantly reduces latency of analytic queries and help people to get insights more interactively. • Alibaba • At Alibaba Search Group, we use Druid for real-time analytics of users' interaction with its popular e-commerce site. • Cisco • Cisco uses Druid to power a real-time analytics platform for network flow data. • eBay • eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over 100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100 concurrent queries without impacting ingest rate and query latencies.
  • 6. Druid In Production - MetaMarkets • 3M+ events/sec through Druid’s real time ingestion. • 100+ PB of data • Application supporting 1000 of queries per sec concurrently. • Supports 1000 of cores for horizontally scale up. • … • Reference :- https://metamarkets.com/2016/impact-on-query-speed-from- forced-processing-ordering-in-druid/ • https://metamarkets.com/2016/distributing-data-in-druid-at-petabyte- scale/
  • 7. A real example of Druid in Action Reference :- https://whynosql.com/2015/11/06/lambda-architecture-with-druid-at-gumgum/
  • 8. Ideal requirements to Druid ? • You need :- • Fast aggregation & arbitrary data exploration in low latency on huge data sets. • Fast response on near real time event data. Ingested data is immediately available for querying) • No SPoF • Handle peta-bytes of data with multiple dimension. • Less than a second in time-oriented summarization of the incoming data stream • NOTE >> before we go to understand the architecture part of it , I want to show u a typical use case just to understand what we have said so far.
  • 9. Druid Concepts – An example • The Data • timestamp publisher advertiser gender country click price • 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65 • 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62 • 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45 • 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 • 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 • 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 GROUP BY timestamp, publisher, advertiser, gender, country :: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price) timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
  • 10. Druid – Architecture C L I N T Indexing Streaming data Real Time Node Broker Node Historical Node H D F S Deep Storage H D F S Over lord node Indexing DATA QUERY Static Data
  • 11. H D F S Druid – Architecture ( cluster mgmt. depedency) C L I N T Indexing Streaming data Real Time Node Broker Node Historical Node Deep Storage H D F S Coordinator node DATA QUERY Zookeeper Meta Store
  • 12. Druid – Components • Broker Node • Real time node • Overlord Node • Middle-Manager Node • Historical Node • Coordinator Node • Aside from these nodes, there are 3 external dependencies to the system: • A running ZooKeeper cluster for cluster service discovery and maintenance of current data topology • A metadata storage instance for maintenance of metadata about the data segments that should be served by the system • A "deep storage" system to hold the stored segments.
  • 13. Druid - Data Storage Layer • Segments and Data Storage • Druid stores its index in segment files, which are partitioned by time • columnar: the data for each column is laid out in separate data structures.
  • 14. Druid – Query • Timeseries • TopN • GroupBy & Aggregations • Time Boundary • Search • Select • a) queryType • b) granularity • c) filter • d) aggregation • e) post-Aggregation
  • 15. Demo
  • 16. Task Submit Commands • 1- clear HDFS storage location • hdfs dfs –rm –r /user/root/segments • Make sure the data source is exist in local FS :- /root/labtest/druid_hadoop/druid-0.10.0/quickstart/Test/pageviewsLatforCountExmaple.json & upload to HDFS. hdfs dfs -put -f pageviewsLat.json /user/root/quickstart/Test • Create index Task on Druid • curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/Test/pageviewsLat-index-forCountExample.json localhost:8090/druid/indexer/v1/task • Task information can be seen in <overlord_Host:8090>/console.html • Verify the segments is created under /user/root/segments • hdfs dfs –ls /user/root/segments
  • 17. Query commands • TopN • This will result Top N pages with latency in descending order. • curl -L -H'Content-Type: application/json' -XPOST --data-binary @quckstart/Test/query/pageviewsLatforCount-top- latency-pages.json http://localhost:8082/druid/v2/?pretty • Timeseries • This will result total latency , filtered by user=“alice” and "granularity": "day“ . [ “all” ] • curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-timeseries- pages.json http://localhost:8082/druid/v2/?pretty • groupBy • A) This is will result aggregated latency grpBy user+url • curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/Test/query/pageviewsLatforCount-aggregateLatencyGrpByURLUser.json http://localhost:8082/druid/v2/?pretty • B) This will result aggregated page count (i.e. number of url accessed ) grpBy user • curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/Test/query/pageviewsLatforCount-countURLAccessedGrpByUser.json http://localhost:8082/druid/v2/?pretty
  • 18. Query commands • Time Boundary • Time boundary queries return the earliest and latest data points of a data set. • curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount- timeBoundary-pages.json http://localhost:8082/druid/v2/?pretty • Search • A search query returns dimension values that match the search specification , like e.g. here searching for dimension url has matches with text “facebook” • curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-search- URL-pages.json http://localhost:8082/druid/v2/?pretty