SlideShare a Scribd company logo
1 of 38
Real-Time Analytics With Druid
Aaron Brooks
Solutions Engineer
abrooks@hortonworks.com
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Druid Overview
 Architecture
 Data model & queries
 Druid with Hive
 Demo
Druid Overview
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid capabilities
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Companies Using Druid
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History
 Development started at Metamarkets
in 2011
 Initial use case
– power ad-tech analytics product
 Open sourced in late 2012
– GPL licensed initially
– Switched to Apache V2 in early 2015
 150+ committers today
 In production at many companies
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Is Red Hot Technology
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
9/1/12
11/1/12
1/1/13
3/1/13
5/1/13
7/1/13
9/1/13
11/1/13
1/1/14
3/1/14
5/1/14
7/1/14
9/1/14
11/1/14
1/1/15
3/1/15
5/1/15
7/1/15
9/1/15
11/1/15
1/1/16
3/1/16
5/1/16
7/1/16
9/1/16
11/1/16
1/1/17
3/1/17
5/1/17
7/1/17
9/1/17
Popularity of Major Data Management Technologies
by GitHub Followers (Source: GitHub)
Cassandra
Hadoop
Kafka
Spark
Storm
Druid
In data / analytics,
only Spark and Kafka
have more traction
than Druid.
Druid is the
foundation of
the modern
streaming
architecture
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cool stuff you can do with Druid
 Spatial Indexing
– Query within a rectangular or circular distance from a point.
 Use cases:
– In-store push offers.
– Count users / devices within a radius.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Fast Facts
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
2 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
3TB per Hour
(Netflix)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Architecture
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Services
 Broker
 Coordinator
 Historical
 Realtime
 Router
 Overlord, Middle Managers and Peons
 Different node types for solving different problems
 Processes dedicated for
– Historical data
– Ingestion
– Coordination
– Result merging
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A Typical Druid Deployment
Many nodes, based on
data size and #queries
HDFS or S3
Superset or Dashboards,
BI Tools
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time, supports fast time-based slice-and-dice.
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Historical Nodes
 Main workhorses of druid cluster
 Load immutable read optimized segments
 Respond to queries
 Use memory mapped files
to load segments
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Broker Nodes
 Keeps track of segment announcements in cluster
– (This information is kept in Zookeeper, much like Storm or HBase do.)
 Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
 Merge results from different query nodes
 (Distributed) caching layer
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Coordinator Nodes
 Assigns segments to historical nodes
 Interval based cost function to distribute segments
 Makes sure query load is uniform across historical nodes
 Handles replication of data
 Configurable rules to load/drop data
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Makes Druid Fast?
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Makes Druid Fast?
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid High Level Architecture
Hadoop
Historical
Node
Historical
Node
Historical
Node
Batch Data
Broker
Node
Queries
ETL
(Samza,
Kafka, Storm,
Spark etc)
Streaming
Data Realtime
Node
Realtime
Node
Handoff
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Batch Indexing
 Indexing is performed by related components:
– Overlord
– Middle Managers
– Peons
 Batch indexing is done on data that already exists in
Deep Storage (e.g. HDFS).
 Middle Managers spawn peons which run ingestions
tasks
 Middle Managers get task definitions that defines
which tasks to run and its properties
 Each peon runs 1 task
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Ingestion: Real time Index Tasks
 Ability to ingest streams of data
 Stores data in write-optimized structure
 Periodically converts write-optimized structure
to read-optimized segments
 Event query-able as soon as it is ingested
 Both push (tranquility) and pull based ingestion
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Realtime Indexing
Deep Storage
Tranquility
Coordinator
Broker
Indexing Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Push
Segments
segment
Segment-
cache
Historical
Segment-
cache
Historical
Spark
Flink
Storm
Python
Pull
Data model and queries
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segment Data Structures
 Within a Segment:
– Timestamp Column Group.
– Dimensions Column Group.
– Metrics Column Group.
– Indexes that facilitate fast lookup and aggregation.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Queries
 Timeseries: Time based queries
 TopN : Equivalent to group_by + order over 1 dimension
– approximate if more than 1000 Dim values
 GroupBy
 Time boundary : queries return the earliest and latest data points of a data set
 Search queries / Select
 For each query we can use operators like
– Granularity (Roll up)
– Filters
– Aggregation / Post-Aggregation
– Etc
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Group-by example
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results example
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid And Hive
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid Hive integration
 Data already existing in druid
 Druid has its own JSON based query
language
 No native BI tools integration
 Point hive to broker and specify data
source name
 Use Hive as a virtualization layer
 Query Druid data with SQL and plug any
BI tool
 Data already existing in Hive .
 Data stored in distributed filesystem like
HDFS, S3 in a format that can be read by
Hive eg TSV, CSV ORC, Parquet etc
 Perform some pre-processing over
various data sources before feeding it to
Druid
 Accelerate query over Hive data
 Join between hot and cold data
Query Druid from Hive with SQL Hive queries acceleration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Druid from Hive with SQL
 Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
 Broker node endpoint specified as a Hive configuration parameter
 Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Druid from Hive Hive Table Creation
Hive Query Plan
Query with SQL
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive queries acceleration
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" =
"HOUR")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
 Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive
column type
– Timestamp –> Time
– Dimensions -> page, user
– Metrics -> c_added, c_removed
Credit jcamacho@apache.org
Hive table name
Hive storage handler classname
Druid data source name
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Index Hive in Druid SQL Druid Query
SQL Hive Query
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example of Druid queries
Cryptocurrency Market Data
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example of Druid queries
Druid Query
Results in shell
Results in Superset
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo

More Related Content

What's hot

Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreDataWorks Summit
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 

What's hot (20)

Keep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its BestKeep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its Best
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 

Similar to Druid Scaling Realtime Analytics

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidRaúl Marín
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Hortonworks
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHaimo Liu
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionMilind Pandit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 

Similar to Druid Scaling Realtime Analytics (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 

Recently uploaded

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Druid Scaling Realtime Analytics

  • 1. Real-Time Analytics With Druid Aaron Brooks Solutions Engineer abrooks@hortonworks.com
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Druid Overview  Architecture  Data model & queries  Druid with Hive  Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid capabilities  Streaming ingestion capability  Data Freshness – analyze events as they occur  Fast response time (ideally < 1sec query time)  Arbitrary slicing and dicing  Multi-tenancy – 1000s of concurrent users  Scalability and Availability  Rich real-time visualization with Superset Superset Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Companies Using Druid
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved History  Development started at Metamarkets in 2011  Initial use case – power ad-tech analytics product  Open sourced in late 2012 – GPL licensed initially – Switched to Apache V2 in early 2015  150+ committers today  In production at many companies
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid Is Red Hot Technology 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 9/1/12 11/1/12 1/1/13 3/1/13 5/1/13 7/1/13 9/1/13 11/1/13 1/1/14 3/1/14 5/1/14 7/1/14 9/1/14 11/1/14 1/1/15 3/1/15 5/1/15 7/1/15 9/1/15 11/1/15 1/1/16 3/1/16 5/1/16 7/1/16 9/1/16 11/1/16 1/1/17 3/1/17 5/1/17 7/1/17 9/1/17 Popularity of Major Data Management Technologies by GitHub Followers (Source: GitHub) Cassandra Hadoop Kafka Spark Storm Druid In data / analytics, only Spark and Kafka have more traction than Druid. Druid is the foundation of the modern streaming architecture
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cool stuff you can do with Druid  Spatial Indexing – Query within a rectangular or circular distance from a point.  Use cases: – In-store push offers. – Count users / devices within a radius.
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Fast Facts Most Events per Day 30 Billion Events / Day (Metamarkets) Most Computed Metrics 2 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 3TB per Hour (Netflix)
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid Architecture
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Services  Broker  Coordinator  Historical  Realtime  Router  Overlord, Middle Managers and Peons  Different node types for solving different problems  Processes dedicated for – Historical data – Ingestion – Coordination – Result merging
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A Typical Druid Deployment Many nodes, based on data size and #queries HDFS or S3 Superset or Dashboards, BI Tools
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time, supports fast time-based slice-and-dice.  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Historical Nodes  Main workhorses of druid cluster  Load immutable read optimized segments  Respond to queries  Use memory mapped files to load segments
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Broker Nodes  Keeps track of segment announcements in cluster – (This information is kept in Zookeeper, much like Storm or HBase do.)  Scatters query across historical and realtime nodes – (Clients issue queries to this node, but queries are processed elsewhere.)  Merge results from different query nodes  (Distributed) caching layer
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Coordinator Nodes  Assigns segments to historical nodes  Interval based cost function to distribute segments  Makes sure query load is uniform across historical nodes  Handles replication of data  Configurable rules to load/drop data
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segment Data Structures  Within a Segment: – Timestamp Column Group. – Dimensions Column Group. – Metrics Column Group. – Indexes that facilitate fast lookup and aggregation.
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Makes Druid Fast?
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Makes Druid Fast?
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid High Level Architecture Hadoop Historical Node Historical Node Historical Node Batch Data Broker Node Queries ETL (Samza, Kafka, Storm, Spark etc) Streaming Data Realtime Node Realtime Node Handoff
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Batch Indexing  Indexing is performed by related components: – Overlord – Middle Managers – Peons  Batch indexing is done on data that already exists in Deep Storage (e.g. HDFS).  Middle Managers spawn peons which run ingestions tasks  Middle Managers get task definitions that defines which tasks to run and its properties  Each peon runs 1 task
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Stream Ingestion: Real time Index Tasks  Ability to ingest streams of data  Stores data in write-optimized structure  Periodically converts write-optimized structure to read-optimized segments  Event query-able as soon as it is ingested  Both push (tranquility) and pull based ingestion
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Realtime Indexing Deep Storage Tranquility Coordinator Broker Indexing Service Overlord MiddleManager Peon Peon Peon ZooKeeper Kafka task Push Segments segment Segment- cache Historical Segment- cache Historical Spark Flink Storm Python Pull
  • 24. Data model and queries
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segment Data Structures  Within a Segment: – Timestamp Column Group. – Dimensions Column Group. – Metrics Column Group. – Indexes that facilitate fast lookup and aggregation.
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Queries  Timeseries: Time based queries  TopN : Equivalent to group_by + order over 1 dimension – approximate if more than 1000 Dim values  GroupBy  Time boundary : queries return the earliest and latest data points of a data set  Search queries / Select  For each query we can use operators like – Granularity (Roll up) – Filters – Aggregation / Post-Aggregation – Etc
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Group-by example
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results example
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid And Hive
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid Hive integration  Data already existing in druid  Druid has its own JSON based query language  No native BI tools integration  Point hive to broker and specify data source name  Use Hive as a virtualization layer  Query Druid data with SQL and plug any BI tool  Data already existing in Hive .  Data stored in distributed filesystem like HDFS, S3 in a format that can be read by Hive eg TSV, CSV ORC, Parquet etc  Perform some pre-processing over various data sources before feeding it to Druid  Accelerate query over Hive data  Join between hot and cold data Query Druid from Hive with SQL Hive queries acceleration
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query Druid from Hive with SQL  Point hive to the broker: – SET hive.druid.broker.address.default=druid.broker.hostname:8082;  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_wikiticker STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker");  Broker node endpoint specified as a Hive configuration parameter  Automatic Druid data schema discovery: segment metadata query Hive table name Hive storage handler classname Druid data source name
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query Druid from Hive Hive Table Creation Hive Query Plan Query with SQL
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive queries acceleration  Use Create Table As Select (CTAS) statement CREATE TABLE druid_wikiticker STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker”, "druid.segment.granularity" = "HOUR") AS SELECT __time, page, user, c_added, c_removed FROM src;  Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type – Timestamp –> Time – Dimensions -> page, user – Metrics -> c_added, c_removed Credit jcamacho@apache.org Hive table name Hive storage handler classname Druid data source name
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Index Hive in Druid SQL Druid Query SQL Hive Query
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example of Druid queries Cryptocurrency Market Data
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example of Druid queries Druid Query Results in shell Results in Superset
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo