Apache Druid 101:
1
Fast, Real-time, Open Source Analytics
Matt Sarrel
Developer Evangelist
● Former IT leadership/CIO roles
● Focus on data management,
data analysis, network
infrastructure and network
security
● 15 years of startup experience
(mostly open source
infrastructure and datastores)
● Former PCMag Tech Director,
GigaOm Pro analyst, eWeek and
InfoWorld contributor
● BA (History) and MPH (ID Epi)
● CISSP certified
● Cook Competitive BBQ (KCBS)
matt.sarrel@imply.io
@msarrel on Twitter
@matt on ASF
#druid Slack
Agenda
3
What is
Druid?
Why was
Druid
created?
Best Uses
for Druid
How Druid
Works
Druid is a high performance real-time
analytics database
Apache Druid Powers Interactive Applications
5
1st gen: on-prem data warehouses
6
The 1st gen architecture was unscalable, complex, and expensive.
Data Sources Processing Store and Compute
BI tools
Reporting
Analytics
Data
Data
Data
ETL Data
Warehouse
2nd gen: cloud data warehouses
7
The 2nd gen, while cheaper and more flexible, still has many latency restrictions.
Data Sources
Storage
Compute
Processing
BI tools
Reporting
Analytics
Data
Data
Data
Data Lake
(S3, Blob
store, etc)
Data
Warehouse
ELT
(Spark,
Hadoop,
etc)
3rd gen: Apache Druid/Imply
8
The 3rd gen architecture is designed for an increasingly low latency world.
Data Sources
Storage
Processing
Next gen
data apps
Interactive
GUIs
Real-time
Analytics
Data
Data
Data
Message Bus
(Kafka, Kinesis,
Pub/Sub)
ELT (Spark
Streaming,
Kafka Streams,
Apache Flink)
Druid
Data
Warehouse
Archiving
Reporting
MetaMarkets: The first use case
Druid was created at a startup called Metamarkets (now part of Snapchat)
Druid was created to power an interactive app for digital advertisers
Advertisers loaded impressions and clicks data
Advertisers used the app to optimize user/ad engagement
Druid has since expanded to many new verticals and use cases
Challenges
• Scale: millions events/sec (batch and real-time)
• Complexity: high dimensionality & high cardinality
• Structure: semi-structured (nested, evolving schemas, etc.)Data:
• Drill downs: static reports aren’t enough (BI tools not
enough)
• Multi-tenancy: thousands of concurrent users
• Self-service: many users are non-technical
App:
Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP
Key features
Column oriented
High concurrency
Scalable to 1000s of servers, millions of messages/sec
Continuous, real-time ingest
Query through SQL via API
Target query latency sub-second to a few seconds
12
Druid in Production
User activity
Data sets: clickstreams, view streams, activity streams
Group users along any attributes, without pre-computation or pre-definition
Compare groups of users against each other
Define interesting groupings quickly through top lists
Count number of users matching any criteria
Network flows
Data set: netflow logs
View relationships between source & dest addresses
Measure flows based on protocol, interface, IP address, or any other attribute
Burstable billing: 95th percentile flow rates in 5 min buckets
Troubleshoot bottlenecks
Digital advertising
Data sets: bids, clicks, impressions, etc.
Analyze campaign performance on ad-hoc groupings of participants
Compute quantiles and histograms for bid prices
Calculate conversion rates (impressions → clicks)
Server metrics
Data sets: server logs, application metrics, etc.
Track CPU load on servers, numbers of cache requests/hits/misses, data center
performance, etc.
Aggregate time series on the fly
Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API
calls; all servers in rack 10; etc)
Druid…it’s out there
The original Druid cluster:
• >500 TB of segments
(>50 trillion raw events,
>50 PB raw data)
• mean 500ms query time
• 90%ile < 1s
• 95%ile < 5s
• 99%ile < 10s
Netflix Druid Cluster:
• 100 billion+ rows/day
• 1+ trillion rows, retained for
at least a year
• 100s of servers
• Sub-second to a few
seconds query response
• Relies on combination of
streaming and batch
ingestion
Druid is designed for performance
Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
There is no such thing as too fast
Is Druid Right For My Project?
Data
Characteristics
Timestamp dimension
Streaming
Denormalized
Many attributes (30+
dimensions)
High cardinality
Use Case
Characteristics Large dataset
Fast query response (<1s)
Low latency data ingestion
Interactive, ad-hoc queries
Arbitrary slicing and dicing (OLAP)
Query real-time & historical data
Infrequent updates
Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events
Druid and Data Warehouses
Druid is not a DW
Druid augments DW to provide
• consistent, sub-second SLA
• pre-aggregation/metrics generation upon ingest
• simple schema
• high concurrency reads
Druid is for hot queries (sub-second queries on fresh data)
• Slice and dice OLAP
• Dashboards that fire dozens of queries at once
DW is for cold queries (second+ queries on historical data)
Druid Architecture
Architecture (Ingestion)
Indexers
Indexers
Indexers
Files
Historicals
Historicals
Historicals
Streams
Segments
The Ingestion Spec
Druid segments
Enables global index and write once consistency.
Engine and data format are tightly integrated
28
Secondary indexes
Operate on
compressed data Late materializationCompression
INDEX
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
DATA0
0
0
1
1
2
2
2
DICT
Melbourne = 0
Perth = 1
Sydney = 2
Querying
Query libraries:
• JSON over
HTTP
• SQL
• R
• Python
• Ruby
Query Processing in Parallel
30
Historical Indexer Historical Indexer
Data server
Broker
Query server
Segments
Apache Druid Unified Console
Resources
druid.apache.org
druid.apache.org/community
Imply Meetup Groups https://www.meetup.com/pro/apache-druid
ASF #druid Slack channel
@druidio on Twitter
Apache Distribution https://github.com/apache/druid
Imply Distribution https://imply.io/get-started

Apache Druid 101

  • 1.
    Apache Druid 101: 1 Fast,Real-time, Open Source Analytics
  • 2.
    Matt Sarrel Developer Evangelist ●Former IT leadership/CIO roles ● Focus on data management, data analysis, network infrastructure and network security ● 15 years of startup experience (mostly open source infrastructure and datastores) ● Former PCMag Tech Director, GigaOm Pro analyst, eWeek and InfoWorld contributor ● BA (History) and MPH (ID Epi) ● CISSP certified ● Cook Competitive BBQ (KCBS) matt.sarrel@imply.io @msarrel on Twitter @matt on ASF #druid Slack
  • 3.
  • 4.
    Druid is ahigh performance real-time analytics database
  • 5.
    Apache Druid PowersInteractive Applications 5
  • 6.
    1st gen: on-premdata warehouses 6 The 1st gen architecture was unscalable, complex, and expensive. Data Sources Processing Store and Compute BI tools Reporting Analytics Data Data Data ETL Data Warehouse
  • 7.
    2nd gen: clouddata warehouses 7 The 2nd gen, while cheaper and more flexible, still has many latency restrictions. Data Sources Storage Compute Processing BI tools Reporting Analytics Data Data Data Data Lake (S3, Blob store, etc) Data Warehouse ELT (Spark, Hadoop, etc)
  • 8.
    3rd gen: ApacheDruid/Imply 8 The 3rd gen architecture is designed for an increasingly low latency world. Data Sources Storage Processing Next gen data apps Interactive GUIs Real-time Analytics Data Data Data Message Bus (Kafka, Kinesis, Pub/Sub) ELT (Spark Streaming, Kafka Streams, Apache Flink) Druid Data Warehouse Archiving Reporting
  • 9.
    MetaMarkets: The firstuse case Druid was created at a startup called Metamarkets (now part of Snapchat) Druid was created to power an interactive app for digital advertisers Advertisers loaded impressions and clicks data Advertisers used the app to optimize user/ad engagement Druid has since expanded to many new verticals and use cases
  • 10.
    Challenges • Scale: millionsevents/sec (batch and real-time) • Complexity: high dimensionality & high cardinality • Structure: semi-structured (nested, evolving schemas, etc.)Data: • Drill downs: static reports aren’t enough (BI tools not enough) • Multi-tenancy: thousands of concurrent users • Self-service: many users are non-technical App:
  • 11.
    Core Design ● Real-timeingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries ● Optimized storage for time-based datasets ● Time-based functions SEARCH PLATFORM TIME SERIES DB OLAP
  • 12.
    Key features Column oriented Highconcurrency Scalable to 1000s of servers, millions of messages/sec Continuous, real-time ingest Query through SQL via API Target query latency sub-second to a few seconds 12
  • 13.
  • 14.
    User activity Data sets:clickstreams, view streams, activity streams Group users along any attributes, without pre-computation or pre-definition Compare groups of users against each other Define interesting groupings quickly through top lists Count number of users matching any criteria
  • 15.
    Network flows Data set:netflow logs View relationships between source & dest addresses Measure flows based on protocol, interface, IP address, or any other attribute Burstable billing: 95th percentile flow rates in 5 min buckets Troubleshoot bottlenecks
  • 16.
    Digital advertising Data sets:bids, clicks, impressions, etc. Analyze campaign performance on ad-hoc groupings of participants Compute quantiles and histograms for bid prices Calculate conversion rates (impressions → clicks)
  • 17.
    Server metrics Data sets:server logs, application metrics, etc. Track CPU load on servers, numbers of cache requests/hits/misses, data center performance, etc. Aggregate time series on the fly Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API calls; all servers in rack 10; etc)
  • 18.
    Druid…it’s out there Theoriginal Druid cluster: • >500 TB of segments (>50 trillion raw events, >50 PB raw data) • mean 500ms query time • 90%ile < 1s • 95%ile < 5s • 99%ile < 10s Netflix Druid Cluster: • 100 billion+ rows/day • 1+ trillion rows, retained for at least a year • 100s of servers • Sub-second to a few seconds query response • Relies on combination of streaming and batch ingestion
  • 19.
    Druid is designedfor performance Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
  • 20.
    There is nosuch thing as too fast
  • 21.
    Is Druid RightFor My Project? Data Characteristics Timestamp dimension Streaming Denormalized Many attributes (30+ dimensions) High cardinality Use Case Characteristics Large dataset Fast query response (<1s) Low latency data ingestion Interactive, ad-hoc queries Arbitrary slicing and dicing (OLAP) Query real-time & historical data Infrequent updates
  • 22.
    Druid in DataPipeline Data lakes Message buses Raw data Staging (and Processing) Analytics Database End User Application clicks, ad impressions network telemetry application events
  • 23.
    Druid and DataWarehouses Druid is not a DW Druid augments DW to provide • consistent, sub-second SLA • pre-aggregation/metrics generation upon ingest • simple schema • high concurrency reads Druid is for hot queries (sub-second queries on fresh data) • Slice and dice OLAP • Dashboards that fire dozens of queries at once DW is for cold queries (second+ queries on historical data)
  • 24.
  • 25.
  • 26.
  • 27.
    Druid segments Enables globalindex and write once consistency.
  • 28.
    Engine and dataformat are tightly integrated 28 Secondary indexes Operate on compressed data Late materializationCompression INDEX [0,1,2](11100000) [3,4] (00011000) [5,6,7](0000111) DATA0 0 0 1 1 2 2 2 DICT Melbourne = 0 Perth = 1 Sydney = 2
  • 29.
    Querying Query libraries: • JSONover HTTP • SQL • R • Python • Ruby
  • 30.
    Query Processing inParallel 30 Historical Indexer Historical Indexer Data server Broker Query server Segments
  • 31.
  • 32.
    Resources druid.apache.org druid.apache.org/community Imply Meetup Groupshttps://www.meetup.com/pro/apache-druid ASF #druid Slack channel @druidio on Twitter Apache Distribution https://github.com/apache/druid Imply Distribution https://imply.io/get-started