Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
2. Matt Sarrel
Developer Evangelist
● Former IT leadership/CIO roles
● Focus on data management,
data analysis, network
infrastructure and network
security
● 15 years of startup experience
(mostly open source
infrastructure and datastores)
● Former PCMag Tech Director,
GigaOm Pro analyst, eWeek and
InfoWorld contributor
● BA (History) and MPH (ID Epi)
● CISSP certified
● Cook Competitive BBQ (KCBS)
matt.sarrel@imply.io
@msarrel on Twitter
@matt on ASF
#druid Slack
6. 1st gen: on-prem data warehouses
6
The 1st gen architecture was unscalable, complex, and expensive.
Data Sources Processing Store and Compute
BI tools
Reporting
Analytics
Data
Data
Data
ETL Data
Warehouse
7. 2nd gen: cloud data warehouses
7
The 2nd gen, while cheaper and more flexible, still has many latency restrictions.
Data Sources
Storage
Compute
Processing
BI tools
Reporting
Analytics
Data
Data
Data
Data Lake
(S3, Blob
store, etc)
Data
Warehouse
ELT
(Spark,
Hadoop,
etc)
8. 3rd gen: Apache Druid/Imply
8
The 3rd gen architecture is designed for an increasingly low latency world.
Data Sources
Storage
Processing
Next gen
data apps
Interactive
GUIs
Real-time
Analytics
Data
Data
Data
Message Bus
(Kafka, Kinesis,
Pub/Sub)
ELT (Spark
Streaming,
Kafka Streams,
Apache Flink)
Druid
Data
Warehouse
Archiving
Reporting
9. MetaMarkets: The first use case
Druid was created at a startup called Metamarkets (now part of Snapchat)
Druid was created to power an interactive app for digital advertisers
Advertisers loaded impressions and clicks data
Advertisers used the app to optimize user/ad engagement
Druid has since expanded to many new verticals and use cases
10. Challenges
• Scale: millions events/sec (batch and real-time)
• Complexity: high dimensionality & high cardinality
• Structure: semi-structured (nested, evolving schemas, etc.)Data:
• Drill downs: static reports aren’t enough (BI tools not
enough)
• Multi-tenancy: thousands of concurrent users
• Self-service: many users are non-technical
App:
11. Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP
12. Key features
Column oriented
High concurrency
Scalable to 1000s of servers, millions of messages/sec
Continuous, real-time ingest
Query through SQL via API
Target query latency sub-second to a few seconds
12
14. User activity
Data sets: clickstreams, view streams, activity streams
Group users along any attributes, without pre-computation or pre-definition
Compare groups of users against each other
Define interesting groupings quickly through top lists
Count number of users matching any criteria
15. Network flows
Data set: netflow logs
View relationships between source & dest addresses
Measure flows based on protocol, interface, IP address, or any other attribute
Burstable billing: 95th percentile flow rates in 5 min buckets
Troubleshoot bottlenecks
16. Digital advertising
Data sets: bids, clicks, impressions, etc.
Analyze campaign performance on ad-hoc groupings of participants
Compute quantiles and histograms for bid prices
Calculate conversion rates (impressions → clicks)
17. Server metrics
Data sets: server logs, application metrics, etc.
Track CPU load on servers, numbers of cache requests/hits/misses, data center
performance, etc.
Aggregate time series on the fly
Compute latency %iles over ad hoc groups of events (all ‘foo’ servers; all ‘/v1/bar’ API
calls; all servers in rack 10; etc)
18. Druid…it’s out there
The original Druid cluster:
• >500 TB of segments
(>50 trillion raw events,
>50 PB raw data)
• mean 500ms query time
• 90%ile < 1s
• 95%ile < 5s
• 99%ile < 10s
Netflix Druid Cluster:
• 100 billion+ rows/day
• 1+ trillion rows, retained for
at least a year
• 100s of servers
• Sub-second to a few
seconds query response
• Relies on combination of
streaming and batch
ingestion
19. Druid is designed for performance
Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
21. Is Druid Right For My Project?
Data
Characteristics
Timestamp dimension
Streaming
Denormalized
Many attributes (30+
dimensions)
High cardinality
Use Case
Characteristics Large dataset
Fast query response (<1s)
Low latency data ingestion
Interactive, ad-hoc queries
Arbitrary slicing and dicing (OLAP)
Query real-time & historical data
Infrequent updates
22. Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events
23. Druid and Data Warehouses
Druid is not a DW
Druid augments DW to provide
• consistent, sub-second SLA
• pre-aggregation/metrics generation upon ingest
• simple schema
• high concurrency reads
Druid is for hot queries (sub-second queries on fresh data)
• Slice and dice OLAP
• Dashboards that fire dozens of queries at once
DW is for cold queries (second+ queries on historical data)
28. Engine and data format are tightly integrated
28
Secondary indexes
Operate on
compressed data Late materializationCompression
INDEX
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
DATA0
0
0
1
1
2
2
2
DICT
Melbourne = 0
Perth = 1
Sydney = 2