SlideShare a Scribd company logo
1 of 55
Download to read offline
1 © Hortonworks Inc. 2011–2018. All rights reserved
An Introduction to Druid
Nishant Bangarwa
Software Developer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
2
Agenda
History and Motivation
Introduction
Data Storage Format
Druid Architecture – Indexing and Querying Data
Druid In Production
Recent Improvements
3 © Hortonworks Inc. 2011–2018. All rights reserved
HISTORY
• Druid Open sourced in late 2012
• Initial Use case
• Power ad-tech analytics product
• Requirements
• Query any combination of metrics and
dimensions
• Scalability : trillions of events/day
• Real-time : data freshness
• Streaming Ingestion
• Interactive : low latency queries
4 © Hortonworks Inc. 2011–2018. All rights reserved
4
How Big is the initial use case ?
5 © Hortonworks Inc. 2011–2018. All rights reserved
5
MOTIVATION
• Business Intelligence Queries
• Arbitrary slicing and dicing of data
• Interactive real time visualizations on
Complex data streams
• Answer BI questions
• How many unique male visitors visited my
website last month ?
• How many products were sold last quarter
broken down by a demographic and product
category ?
• Not interested in dumping entire dataset
6 © Hortonworks Inc. 2011–2018. All rights reserved
Introduction
7 © Hortonworks Inc. 2011–2018. All rights reserved
7
What is Druid ?
• Column-oriented distributed datastore
• Sub-Second query times
• Realtime streaming ingestion
• Arbitrary slicing and dicing of data
• Automatic Data Summarization
• Approximate algorithms (hyperLogLog, theta)
• Scalable to petabytes of data
• Highly available
8 © Hortonworks Inc. 2011–2018. All rights reserved
8
Companies Using Druid
9 © Hortonworks Inc. 2011–2018. All rights reserved
Druid Architecture
10 © Hortonworks Inc. 2011–2018. All rights reserved
1
Node Types
• Realtime Nodes
• Historical Nodes
• Broker Nodes
• Coordinator Nodes
11 © Hortonworks Inc. 2011–2018. All rights reserved
Realtime
Nodes
Historical
Nodes
1
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff
12 © Hortonworks Inc. 2011–2018. All rights reserved
1
Druid Architecture
Batch Data
Queries
Metadata
Store
Coordinator
Nodes
Zookeepe
r
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Handoff
13 © Hortonworks Inc. 2011–2018. All rights reserved
Storage Format
14 © Hortonworks Inc. 2011–2018. All rights reserved
Druid: Segments
• Data in Druid is stored in Segment Files.
• Partitioned by time
• Ideally, segment files are each smaller than 1GB.
• If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday
15 © Hortonworks Inc. 2011–2018. All rights reserved
1
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics
16 © Hortonworks Inc. 2011–2018. All rights reserved
1
Data Rollup
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour
17 © Hortonworks Inc. 2011–2018. All rights reserved
1
Dictionary Encoding
• Create and store Ids for each value
• e.g. page column
⬢ Values - Justin Bieber, Ke$ha, Selena Gomes
⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
⬢ Column Data - [0 0 0 1 1 2]
• city column - [0 0 0 1 1 1]
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
18 © Hortonworks Inc. 2011–2018. All rights reserved
1
Bitmap Indices
• Store Bitmap Indices for each value
⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1]
• Queries
⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
• Indexes compressed with Concise or Roaring encoding
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99
2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
19 © Hortonworks Inc. 2011–2018. All rights reserved
1
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour
20 © Hortonworks Inc. 2011–2018. All rights reserved
Approximate Sketch Columns
• Better rollup for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values
21 © Hortonworks Inc. 2011–2018. All rights reserved
Indexing Data
22 © Hortonworks Inc. 2011–2018. All rights reserved
Indexing Service
• Indexing is performed by
• Overlord
• Middle Managers
• Peons
• Middle Managers spawn peons which runs ingestion
tasks
• Each peon runs 1 task
• Task definition defines which task to run and its
properties
23 © Hortonworks Inc. 2011–2018. All rights reserved
2
Streaming Ingestion : Realtime Index Tasks
• Ability to ingest streams of data
• Stores data in write-optimized structure
• Periodically converts write-optimized structure
to read-optimized segments
• Event query-able as soon as it is ingested
• Both push and pull based ingestion
24 © Hortonworks Inc. 2011–2018. All rights reserved
Streaming Ingestion : Tranquility
• Helper library for coordinating
streaming ingestion
• Simple API to send events to
druid
• Transparently Manages
• Realtime index Task Creation
• Partitioning and Replication
• Schema Evolution
• Can be used with your favourite
ETL framework e.g Flink, Nifi,
Samza, Spark, Storm
• At-least once ingestion
25 © Hortonworks Inc. 2011–2018. All rights reserved
Kafka Indexing Service (experimental)
• Supports Exactly once ingestion
• Messages pulled by Kafka Index Tasks
• Each Kafka Index Task consumes from a set
of partitions with specific start and end
offset
• Each message verified to ensure sequence
• Kafka Offsets and corresponding segments
persisted in same metadata transaction
atomically
• Kafka Supervisor
• embedded inside overlord
• Manages kafka index tasks
• Retry failed tasks
Task 1
Task 2
Task 3
26 © Hortonworks Inc. 2011–2018. All rights reserved
Batch Ingestion
• HadoopIndexTask
• Peon launches Hadoop MR job
• Mappers read data
• Reducers create Druid segment files
• Index Task
• Runs in single JVM i.e peon
• Suitable for data sizes(<1G)
• Integrations with Apache HIVE and Spark for Batch Ingestion
27 © Hortonworks Inc. 2011–2018. All rights reserved
Querying Data
28 © Hortonworks Inc. 2011–2018. All rights reserved
Querying Data from Druid
• Druid supports
• JSON Queries over HTTP
• In built SQL (experimental)
• Querying libraries available for
• Python
• R
• Ruby
• Javascript
• Clojure
• PHP
• Multiple Open source UI tools
29 © Hortonworks Inc. 2011–2018. All rights reserved
2
JSON Over HTTP
• HTTP Rest API
• Queries and results expressed in JSON
• Multiple Query Types
• Time Boundary
• Timeseries
• TopN
• GroupBy
• Select
• Segment Metadata
30 © Hortonworks Inc. 2011–2018. All rights reserved
In built SQL (experimental)
• Apache Calcite based parser and planner
• Ability to connect druid to any BI tool that supports JDBC
• SQL via JSON over HTTP
• Supports Approximate queries
• APPROX_COUNT_DISTINCT(col)
• Ability to do Fast Approx TopN queries
• APPROX_QUANTILE(column, probability)
31 © Hortonworks Inc. 2011–2018. All rights reserved
Integrated with multiple Open Source UI tools
• Superset –
• Developed at AirBnb
• In Apache Incubation since May 2017
• Grafana – Druid plugin
• Metabase
• With in-built SQL, connect with any BI tool supporting JDBC
32 © Hortonworks Inc. 2011–2018. All rights reserved
Druid in Production
33 © Hortonworks Inc. 2011–2018. All rights reserved
Druid in Production
 Is Druid suitable for my Use case ?
 Will Druid meet my performance requirements at scale ?
 How complex is it to Operate and Manage Druid cluster ?
 How to monitor a Druid cluster ?
 High Availability ?
 How to upgrade Druid cluster without downtime ?
 Security ?
34 © Hortonworks Inc. 2011–2018. All rights reserved
Suitable Use Cases
• Powering Interactive user facing applications
• Arbitrary slicing and dicing of large datasets
• User behavior analysis
• measuring distinct counts
• retention analysis
• funnel analysis
• A/B testing
• Exploratory analytics/root cause analysis
• Not interested in dumping entire dataset
35 © Hortonworks Inc. 2011–2018. All rights reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
36 © Hortonworks Inc. 2011–2018. All rights reserved
3
Performance Numbers
• Query Latency
• average - 500ms
• 90%ile < 1sec
• 95%ile < 5sec
• 99%ile < 10 sec
• Query Volume
• 1000s queries per minute
• Benchmarking code
• https://github.com/druid-
io/druid-benchmark
37 © Hortonworks Inc. 2011–2018. All rights reserved
Simplified Druid Cluster Management with Ambari
 Install, configure and manage Druid and all external dependencies from Ambari
 Easy to enable HA, Security, Monitoring …
38 © Hortonworks Inc. 2011–2018. All rights reserved
Simplified Druid Cluster Management with Ambari
39 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring a Druid Cluster
• Each Druid Node emits metrics for
• Query performance
• Ingestion Rate
• JVM Health
• Query Cache performance
• System health
• Emitted as JSON objects to a runtime log file or over HTTP to other services
• Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka
• Easy to implement your own metrics emitter
40 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
• HDP 2.6.1 contains pre-defined grafana dashboards
• Health of Druid Nodes
• Ingestion
• Query performance
• Easy to create new dashboards and setup alerts
• Auto configured when both Druid and Ambari Metrics Server are installed
41 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
42 © Hortonworks Inc. 2011–2018. All rights reserved
Monitoring using Ambari Metrics Server
43 © Hortonworks Inc. 2011–2018. All rights reserved
High Availability
• Deploy Coordinator/Overlord on multiple instances
• Leader election in zookeeper
• Broker – install multiple brokers
• Use druid Router/ Any Load balancer to route queries to brokers
• Realtime Index Tasks – create redundant tasks.
• Historical Nodes – create load rule with replication factor >= 2 (default = 2)
44 © Hortonworks Inc. 2011–2018. All rights reserved
4
Rolling Upgrades
 Shared Nothing Architecture
⬢ Maintain backwards compatibility
⬢ Data redundancy
 Upgrade one Druid Component at a time
⬢ No Downtime
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
45 © Hortonworks Inc. 2011–2018. All rights reserved
Security
• Supports Authentication via Kerberos/ SPNEGO
• Easy Wizard based kerberos security enablement via Ambari
Druid
KDC server
User
Browser1 kinit user
2 Token
46 © Hortonworks Inc. 2011–2018. All rights reserved
4
Summary
• Easy installation and management via Ambari
• Real-time
• Ingestion latency < seconds.
• Query latency < seconds.
• Arbitrary slice and dice big data like ninja
• No more pre-canned drill downs.
• Query with more fine-grained granularity.
• High availability and Rolling deployment capabilities
• Secure and Production ready
• Vibrant and Active community
• Available as Tech Preview in HDP 2.6.1
47 © Hortonworks Inc. 2011–2018. All rights reserved
4
• Druid website – http://druid.io
• Druid User Group - dev@druid.incubator.apache.org
• Druid Dev Group – users@druid.incubator.apache.org
Useful Resources
48 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
Twitter - @NishantBangarwa
Email - nbangarwa@hortonworks.com
49 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
50 © Hortonworks Inc. 2011–2018. All rights reserved
Extending Core Druid
• Plugin Based Architecture
• leverage Guice in order to load extensions at runtime
• Possible to add extension to
• Add a new deep storage implementation
• Add a new Firehose for Ingestion
• Add Aggregators
• Add Complex metrics
• Add new Query types
• Add new Jersey resources
• Bundle your extension with all the other Druid extensions
51 © Hortonworks Inc. 2011–2018. All rights reserved
Performance : Approximate Algorithms
• Ability to Store Approximate Data Sketches for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate Top-K queries
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values
52 © Hortonworks Inc. 2011–2018. All rights reserved
Superset
• Python backend
• Flask app builder
• Authentication
• Pandas for rich analytics
• SqlAlchemy for SQL toolkit
• Javascript frontend
• React, NVD3
• Deep integration with Druid
53 © Hortonworks Inc. 2011–2018. All rights reserved
Superset Rich Dashboarding Capabilities: Treemaps
54 © Hortonworks Inc. 2011–2018. All rights reserved
Superset Rich Dashboarding Capabilities: Sunburst
55 © Hortonworks Inc. 2011–2018. All rights reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!

More Related Content

What's hot

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and FlinkBryan Bende
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index FileNavis Ryu
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 

What's hot (20)

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Druid
DruidDruid
Druid
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to An Introduction to Druid

Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidRaúl Marín
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsAccumulo Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanAnkit Singhal
 

Similar to An Introduction to Druid (20)

Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo Communications
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 

Recently uploaded (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 

An Introduction to Druid

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved An Introduction to Druid Nishant Bangarwa Software Developer
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved 2 Agenda History and Motivation Introduction Data Storage Format Druid Architecture – Indexing and Querying Data Druid In Production Recent Improvements
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HISTORY • Druid Open sourced in late 2012 • Initial Use case • Power ad-tech analytics product • Requirements • Query any combination of metrics and dimensions • Scalability : trillions of events/day • Real-time : data freshness • Streaming Ingestion • Interactive : low latency queries
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved 4 How Big is the initial use case ?
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved 5 MOTIVATION • Business Intelligence Queries • Arbitrary slicing and dicing of data • Interactive real time visualizations on Complex data streams • Answer BI questions • How many unique male visitors visited my website last month ? • How many products were sold last quarter broken down by a demographic and product category ? • Not interested in dumping entire dataset
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Introduction
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved 7 What is Druid ? • Column-oriented distributed datastore • Sub-Second query times • Realtime streaming ingestion • Arbitrary slicing and dicing of data • Automatic Data Summarization • Approximate algorithms (hyperLogLog, theta) • Scalable to petabytes of data • Highly available
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved 8 Companies Using Druid
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Druid Architecture
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved 1 Node Types • Realtime Nodes • Historical Nodes • Broker Nodes • Coordinator Nodes
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Realtime Nodes Historical Nodes 1 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved 1 Druid Architecture Batch Data Queries Metadata Store Coordinator Nodes Zookeepe r Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Handoff
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Storage Format
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Druid: Segments • Data in Druid is stored in Segment Files. • Partitioned by time • Ideally, segment files are each smaller than 1GB. • If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved 1 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved 1 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved 1 Dictionary Encoding • Create and store Ids for each value • e.g. page column ⬢ Values - Justin Bieber, Ke$ha, Selena Gomes ⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 ⬢ Column Data - [0 0 0 1 1 2] • city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved 1 Bitmap Indices • Store Bitmap Indices for each value ⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] ⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0] ⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1] • Queries ⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] ⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] • Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved 1 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Approximate Sketch Columns • Better rollup for high cardinality columns e.g userid • Reduced storage size • Use Cases • Fast approximate distinct counts • Approximate histograms • Funnel/retention analysis • Limitation • Not possible to do exact counts • filter on individual row values
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Indexing Data
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Indexing Service • Indexing is performed by • Overlord • Middle Managers • Peons • Middle Managers spawn peons which runs ingestion tasks • Each peon runs 1 task • Task definition defines which task to run and its properties
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved 2 Streaming Ingestion : Realtime Index Tasks • Ability to ingest streams of data • Stores data in write-optimized structure • Periodically converts write-optimized structure to read-optimized segments • Event query-able as soon as it is ingested • Both push and pull based ingestion
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Streaming Ingestion : Tranquility • Helper library for coordinating streaming ingestion • Simple API to send events to druid • Transparently Manages • Realtime index Task Creation • Partitioning and Replication • Schema Evolution • Can be used with your favourite ETL framework e.g Flink, Nifi, Samza, Spark, Storm • At-least once ingestion
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Kafka Indexing Service (experimental) • Supports Exactly once ingestion • Messages pulled by Kafka Index Tasks • Each Kafka Index Task consumes from a set of partitions with specific start and end offset • Each message verified to ensure sequence • Kafka Offsets and corresponding segments persisted in same metadata transaction atomically • Kafka Supervisor • embedded inside overlord • Manages kafka index tasks • Retry failed tasks Task 1 Task 2 Task 3
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Batch Ingestion • HadoopIndexTask • Peon launches Hadoop MR job • Mappers read data • Reducers create Druid segment files • Index Task • Runs in single JVM i.e peon • Suitable for data sizes(<1G) • Integrations with Apache HIVE and Spark for Batch Ingestion
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Querying Data
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Querying Data from Druid • Druid supports • JSON Queries over HTTP • In built SQL (experimental) • Querying libraries available for • Python • R • Ruby • Javascript • Clojure • PHP • Multiple Open source UI tools
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved 2 JSON Over HTTP • HTTP Rest API • Queries and results expressed in JSON • Multiple Query Types • Time Boundary • Timeseries • TopN • GroupBy • Select • Segment Metadata
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved In built SQL (experimental) • Apache Calcite based parser and planner • Ability to connect druid to any BI tool that supports JDBC • SQL via JSON over HTTP • Supports Approximate queries • APPROX_COUNT_DISTINCT(col) • Ability to do Fast Approx TopN queries • APPROX_QUANTILE(column, probability)
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Integrated with multiple Open Source UI tools • Superset – • Developed at AirBnb • In Apache Incubation since May 2017 • Grafana – Druid plugin • Metabase • With in-built SQL, connect with any BI tool supporting JDBC
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Druid in Production
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Druid in Production  Is Druid suitable for my Use case ?  Will Druid meet my performance requirements at scale ?  How complex is it to Operate and Manage Druid cluster ?  How to monitor a Druid cluster ?  High Availability ?  How to upgrade Druid cluster without downtime ?  Security ?
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Suitable Use Cases • Powering Interactive user facing applications • Arbitrary slicing and dicing of large datasets • User behavior analysis • measuring distinct counts • retention analysis • funnel analysis • A/B testing • Exploratory analytics/root cause analysis • Not interested in dumping entire dataset
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved 3 Performance Numbers • Query Latency • average - 500ms • 90%ile < 1sec • 95%ile < 5sec • 99%ile < 10 sec • Query Volume • 1000s queries per minute • Benchmarking code • https://github.com/druid- io/druid-benchmark
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Simplified Druid Cluster Management with Ambari  Install, configure and manage Druid and all external dependencies from Ambari  Easy to enable HA, Security, Monitoring …
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Simplified Druid Cluster Management with Ambari
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring a Druid Cluster • Each Druid Node emits metrics for • Query performance • Ingestion Rate • JVM Health • Query Cache performance • System health • Emitted as JSON objects to a runtime log file or over HTTP to other services • Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka • Easy to implement your own metrics emitter
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server • HDP 2.6.1 contains pre-defined grafana dashboards • Health of Druid Nodes • Ingestion • Query performance • Easy to create new dashboards and setup alerts • Auto configured when both Druid and Ambari Metrics Server are installed
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Monitoring using Ambari Metrics Server
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved High Availability • Deploy Coordinator/Overlord on multiple instances • Leader election in zookeeper • Broker – install multiple brokers • Use druid Router/ Any Load balancer to route queries to brokers • Realtime Index Tasks – create redundant tasks. • Historical Nodes – create load rule with replication factor >= 2 (default = 2)
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved 4 Rolling Upgrades  Shared Nothing Architecture ⬢ Maintain backwards compatibility ⬢ Data redundancy  Upgrade one Druid Component at a time ⬢ No Downtime 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Security • Supports Authentication via Kerberos/ SPNEGO • Easy Wizard based kerberos security enablement via Ambari Druid KDC server User Browser1 kinit user 2 Token
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved 4 Summary • Easy installation and management via Ambari • Real-time • Ingestion latency < seconds. • Query latency < seconds. • Arbitrary slice and dice big data like ninja • No more pre-canned drill downs. • Query with more fine-grained granularity. • High availability and Rolling deployment capabilities • Secure and Production ready • Vibrant and Active community • Available as Tech Preview in HDP 2.6.1
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved 4 • Druid website – http://druid.io • Druid User Group - dev@druid.incubator.apache.org • Druid Dev Group – users@druid.incubator.apache.org Useful Resources
  • 48. 48 © Hortonworks Inc. 2011–2018. All rights reserved Thank you Twitter - @NishantBangarwa Email - nbangarwa@hortonworks.com
  • 49. 49 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 50. 50 © Hortonworks Inc. 2011–2018. All rights reserved Extending Core Druid • Plugin Based Architecture • leverage Guice in order to load extensions at runtime • Possible to add extension to • Add a new deep storage implementation • Add a new Firehose for Ingestion • Add Aggregators • Add Complex metrics • Add new Query types • Add new Jersey resources • Bundle your extension with all the other Druid extensions
  • 51. 51 © Hortonworks Inc. 2011–2018. All rights reserved Performance : Approximate Algorithms • Ability to Store Approximate Data Sketches for high cardinality columns e.g userid • Reduced storage size • Use Cases • Fast approximate distinct counts • Approximate Top-K queries • Approximate histograms • Funnel/retention analysis • Limitation • Not possible to do exact counts • filter on individual row values
  • 52. 52 © Hortonworks Inc. 2011–2018. All rights reserved Superset • Python backend • Flask app builder • Authentication • Pandas for rich analytics • SqlAlchemy for SQL toolkit • Javascript frontend • React, NVD3 • Deep integration with Druid
  • 53. 53 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Treemaps
  • 54. 54 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Sunburst
  • 55. 55 © Hortonworks Inc. 2011–2018. All rights reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!

Editor's Notes

  1. Motivation Druid introduction and use case Demo Druid Architecture Storage Internals Recent Improvements
  2. Initial Use Case Power ad-tech analytics product at metamarkets. Similar to as shown in the picture in the right, A dashboard where you can visualize timeseries data and do arbitrary filtering and grouping on any combinations of dimensions. Requirements - Data store needs to support Arbitrary queries i.e users should be able to filter and group on any combination of dimensions. Scalability : should be able to handle trillions of events/day Interactive : since the data store was going to power and interactive dashboard low latency queries was must Real-time : the time when between an event occurred and it is visible dashboard should be mininal (order of few seconds..) High Availability – no central point of failure Rolling Upgrades – the architecture was required to support Rolling upgrades
  3. MOTIVATION Interactive real time visualizations on Complex data streams Answer BI questions How many unique male visitors visited my website last month ? How many products were sold last quarter broken down by a demographic and product category ? Not interested in dumping entire dataset Suppose I am running an ad campaign, and I want to understand what kind of Impressions are there What is my click through rate How many users decided to purchase my services We have User Activity Stream and we may want to know How the users are behaving. We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime. Also, For very large distributed clusters there is a need to answer questions about application performance. How individual node in my cluster behaving ? Are there any Anomalies in query response time ? All the above use cases can have data streams which can be huge in volume depending on the scale of business. How do I analyze this information ? How do I get insights from these Stream of Events in realtime ?
  4. What is Druid ? Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns. Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times Realtime streaming ingestion from almost any ETL pipeline. Arbitrary slicing and dicing of data – no need to create pre-canned drill downs Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time. Approximate algorithms (hyperLogLog, theta) – for fast approximate answers Scalable to petabytes of data Highly available
  5. This shows some of the production users. I can talk about some of the large ones which have common use cases. Alibaba and Ebay use druid for ecommerce and user behavior analytics Cisco has a realtime analytics product for analyzing network flows Yahoo uses druid for user behavior analytics and realtime cluster monitoring Hulu does interactive analysis on user and application behavior Paypal, SK telecom – uses druid for business analytics
  6. Realtime Nodes - Handle Real-Time Ingestion, Support both pull & push based ingestion. Store data in Row Oriented write optimized Structure Periodically convert write optimized structure read optimized Structure Ability to serve queries as soon as data is ingested. Historical Nodes - Main workhorses of druid clatter Use Memory Mapped files to load columnar data Respond to User queries Broker Nodes - Keeps track of which node is service which portion of data Ability to scatter query across multiple Historical and Realtime nodes Caching Layer
  7. Druid has concept of different nodes, where each node is designed and optimized to perform specific set of tasks. Realtime Index Tasks / Realtime Nodes- Handle Real-Time Ingestion, Support both pull & push based ingestion. Handle Queries - Ability to serve queries as soon as data is ingested. Store data in write optimized data structure on heap, periodically convert it to write optimized time partitioned immutable segments and persist it to deep storage. In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL and send your massaged data to druid. Deep storage can be any distributed FS and acts as a permanent backup of data Historical Nodes - Main workhorses of druid cluster Use Memory Mapped files to load immutable segments Respond to User queries Now Lets see the how data can be queried. Broker Nodes - Keeps track of the data chunks being loaded by each node in the cluster Ability to scatter query across multiple Historical and Realtime nodes Caching Layer Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into time partitioned segments and persist it to deep storage.
  8. With many historical nodes in a cluster there is a need for balance the load across them, this is done by the Coordinator Nodes - Uses Zookeeper for coordination Asks historical Nodes to load or drop data They also move data across historical nodes to balances load in the cluster Manages Data replication External Dependencies – Metadata Storage – for storing metadata about the segments i.e the location of segments, information on how to load the segments etc. Memcache/ Redis cache – you can optionally add a memcache or redis cache which can be used to cache partial query results.
  9. Druid: Segments Data in Druid is stored in Segment Files. Partitioned by time Ideally, segment files are each smaller than 1GB. If files are large, smaller time partitions are needed.
  10. Example Wikipedia Edit Dataset
  11. Data Rollup Rollup by hour
  12. Dictionary Encoding Create and store Ids for each value e.g. page column Values - Justin Bieber, Ke$ha, Selena Gomes Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 Column Data - [0 0 0 1 1 2] city column - [0 0 0 1 1 1]
  13. Bitmap Indices Store Bitmap Indices for each value Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] Ke$ha -> [3, 4] -> [0 0 0 1 1 0] Selena Gomes -> [5] -> [0 0 0 0 0 1] Queries Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] Indexes compressed with Concise or Roaring encoding
  14. Data Rollup Rollup by hour
  15. Indexing Service is highly-available, distributed service that runs indexing related tasks. The indexing service is composed of three main components: Overlord - responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers. Middle Managers - The middle manager node is a worker node that executes submitted tasks. they launch peons that actually runs the tasks. Peons – managed by middlemanagers and runs a single task. It gets a task definition which is a json spec file that describes the task to perform. All the coordination and communication for task assignment, announcing task stustuses is done via zookeeper.
  16. Streaming Ingestion Done by Realtime Index Tasks Ability to ingest streams of data Stores data in write-optimized structure – row oriented key-value store Indexed by time and dimension values Periodically based on either a time interval or threshold on number of rows it converts write-optimized structure to read-optimized segments Event query-able as soon as it is ingested Both push and pull based ingestion
  17. Tranquility is a helper library in druid which provides easy coordination and task management for performing streaming ingestion into druid It has a very simple API which you can use to send events to druid. On the right hand side you can see a simple example sending an event to druid. So you just create a Tranquilzer with config, The config contains the location of druid overlord, name of your datasource and other ingestion related properties. Just simply call send on the tranquilizer, it will automatically takes care of creating a druid task, managing lifecycle of the task, discovering location of the task and sending data to that task.
  18. We have also added an experimental support for ingesting data from Kafka that also supports exactly once consumption of data. How kafka works is as follows – Each message written to Kafka is placed into an ordered and immutable sequence called a partition and is assigned a sequentially incrementing identifier called an offset. Messages are pulled by druid tasks which verify the sequence and offsets to ensure the sequence. Then at time of persisting the data both the segments and information related kafka offsets is persisted in a single transaction. Since we have the offsets in the metadata in case of failures, we can start reading from that offset again.
  19. Batch Ingestion – ingested data in batch. HadoopIndexTask Peon launches Hadoop MR job Mappers read data Reducers create Druid segment files Index Task Suitable for data sizes(<1G)
  20. Druid broker nodes exposes HTTP endpoints where users can post the queries Queries and results expressed in JSON Multiple Query Types On the right we have an example of a groupBy query in the json you can see In the json query you can specify the datasource, granularity – time by which you want to bucket your data, any filter you may want to use, List of aggregations that you need to perform and any post aggregations like average etc.
  21. The second and easier way to query druid is using SQL (suport for inbuilt SQL is experimental at present) We leverage apache calcite for parsing and planning the query. It also uses Avatica which is a framework for building JDBC drivers for databases So using this, you can connect any BI tool that supports JDBC to druid. Druid also defines some new operators for supporting approximate queries,
  22. Retention analysis
  23. Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  24. Query Latency average - 500ms 90%ile < 1sec 95%ile < 5sec 99%ile < 10 sec Query Volume 1000s queries per minute
  25. Query performance – query time, segment scan time … Ingestion Rate – events ingested, events persisted … JVM Health – JVM Heap usage, GC stats … Cache Related – cache hits, cache misses, cache evictions … System related – cpu, disk, network, swap usage etc..
  26. No Downtime Data redundancy Rolling upgrades
  27. You can secure Druid nodes using Kerberos, and use SPNEGO mechanism to interact with druid HTTP end points.
  28. Summary It is easy to install and manage druid via Ambari Realtime with ingestion and query latency of the order of few secs. Arbitrary slicing and dicing of data
  29. Summary It is easy to install and manage druid via Ambari Realtime with ingestion and query latency of the order of few secs. Arbitrary slicing and dicing of data
  30. Guice which is a  lightweight dependency injection framework