Benchmarking Apache Druid

Benchmarking Apache Druid
July 16, 2020
1
Matt Sarrel (matt.sarrel@imply.io)
Developer Evangelist

2
Agenda:
1. Intro
2. Why Benchmark?
3. Star Schema Benchmark
4. What We Did
5. DIY Druid Benchmarking

Imply Overview
3
Founded by the creators of Apache Druid
Funded by Tier 1 investors
Trusted by innovative enterprises
Best-in-class revenue growth
41x
ARR
growth in
3 years
Leading contributor to Druid

Open core
Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure.
Druid
● Next generation analytics engine
● Widely adopted
Workflow transformation
● Subsecond speed unlocks new workflows
● Self-service explanations of data patterns
● Make data fun again
4

Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP

Key features
● Column oriented
● High concurrency
● Scalable to 1000s of servers, millions of messages/sec
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds
6

Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events

Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks

Test Configs
Data NodesD
● 3 i3.2xlarge (8CPU / 61GB RAM / 1.9TB
NVMe SSD storage)
Query NodesQ ● 2 m5d.large (2 CPU / 8GB RAM)
Master NodesM ● 1 m5.large (2 CPU / 8GB RAM)

Streaming Ingestion
Method Kafka Kinesis Tranquility
Supervisor
type
kafka kinesis N/A
How it works
Druid reads
directly from
Apache Kafka.
Druid reads directly
from Amazon
Kinesis.
Tranquility, a library that ships separately
from Druid, is used to push data into Druid.
Can ingest
late data?
Yes Yes
No (late data is dropped based on the
windowPeriod config)
Exactly-once
guarantees?
Yes Yes No

Batch Ingestion
Method Native batch (simple) Native batch (parallel) Hadoop-based
Parallel? No. Each task is single-threaded.
Yes, if firehose is splittable and
maxNumConcurrentSubTasks > 1 in
tuningConfig. See firehose
documentation for details.
Yes, always.
Can append or
overwrite?
Yes, both. Yes, both. Overwrite only.
File formats
Text file formats (CSV, TSV,
JSON).
Text file formats (CSV, TSV, JSON).
Any Hadoop
InputFormat.
Rollup modes
Perfect if forceGuaranteedRollup =
true in the tuningConfig.
Perfect if forceGuaranteedRollup =
true in the tuningConfig.
Always perfect.
Partitioning
options
Hash-based partitioning is
supported when
forceGuaranteedRollup = true in
the tuningConfig.
Hash-based partitioning (when
forceGuaranteedRollup = true).
Hash-based or range-
based partitioning via
partitionsSpec.

Is Druid Right For My Project?
● Timestamp dimension
● Streaming
● Denormalized
● Many attributes (30+ dimensions)
● High cardinality
Data Characteristics
● Large dataset
● Fast query response (<1s)
● Low latency data ingestion
● Interactive, ad-hoc queries
● Arbitrary slicing and dicing (OLAP)
● Query real-time & historical data
● Infrequent updates
Use Case Characteristics

Long Term Benchmark Plan
● Loosely follow the enterprise digital transformation journey
● Using widely accepted benchmarks, characterize query
performance on batched data
● Using widely accepted data sets benchmarks, characterize
streaming data ingestion and query performance
● Fully characterize ingestion with respect to timing and storage
● Develop the Streaming OLAP Benchmark the world needs

Druid and Data Warehouses
● Druid is not a DW
● Druid augments DW to provide the following
○ consistent, sub-second SLA
○ pre-aggregation/metrics generation upon ingest
○ simple schema
○ high concurrency reads
● Hot and warm queries in Druid, cold queries in DW
● Druid for internal and external customer powering realtime
visualization
● DW for internal customer

Confidential. Do not redistribute.
Realtime DW Solution Architecture
16
Apps
Storage
Machines
Events Stream > Parse > Search > Detect >
Correlate
Custom Dashboard
Notify
ETL ML
Block Control Permit
Allow Prohibit Custom
Data centers
managed/unmanaged

Confidential. Do not redistribute.
Logical Test Architecture
17

Star Schema Benchmark
● Designed to evaluate database system performance of star
schema data warehouse queries
● Based on TPC-H
● Widely used since 2007
● Combines standard generated test data with 13 SQL queries
● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF

Star Schema Benchmark Data Generation
● DBGEN utility
● Generates
● Fact table – lineorder.tbl
● Dimension tables –
customer.tbl, part.tbl,
supplier.tbl, date.tbl
● Scale Factor (SF=1) to
generate 600 million rows
or roughly 100GB

SSB ETL and Ingestion
● TBL files are tab delimited
● Generate on EBS, store on S3
● Amazon Athena (Apache Hive) used to denormalize 5 files into
one
● Saved in ORC and parquet formats for flexibility (ORC tested in
Druid)

How data is structured
● Druid stores data in immutable segments
● Column-oriented compressed format
● Dictionary-encoded at column level
● Bitmap Index Compression : concise & roaring
○ Roaring -typically recommended, faster for boolean operations such
as filters
● Rollup (partial aggregation)

Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction

Controlling Segment Size
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000

Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance

Ingestion (and the 5 million rows)

Run Rules
We ran JMeter against each platform’s HTTP API under the following conditions:
Query cache off
Each SSB query was run 10 times (10 samples per query)
Each query flight consisted of all 13 SSB queries run in succession
For each test, Average Response Time, Lowest Response Time, Highest Response
Time, and Average Response Time Standard Deviation per query were calculated
Each test was repeated five times
The lowest and highest test results were discarded, a standard practice to remove
outliers from performance testing results, leaving results from 3 test runs
The remaining 3 results for each query were averaged to provide results for
Average Response Time, Lowest Response Time, Highest Response Time, and
Average Response Time Standard Deviation per query were calculated

Star Schema Benchmark Queries
● Designed to be around classic DW use cases
● Select from table exactly once
● Restrictions on dimensions
● Druid supports native and SQL queries

13 Queries in Plain English
Query Flight 1 has restrictions on 1 dimension and measures revenue increase from eliminating ranges of discounts in given product order quantity intervals shipped in a given year.
Q1.1 has restrictions d_year = 1993, lo_quantity < 25, and lo_discount between 1 and 3.
Q1.2 changes restrictions of Q1.1 to d_yearmonthnum = 199401, lo_quantity between 26 and 35, lo_discount between 4 and 6.
Q1.3 changes the restrictions to d_weeknuminyear = 6 and d_year= 1994, lo_quantity between 36 and 40, and lo_discount between 5 and 7
Query flight 2 has restictions on 2 dimensions. The query compares revenues for certain product classes and suppliers in a certain region, grouped by more restrictive product classes and all years of orders.
2.1 has restrictions on p_category and s_region.
2.2 changes restrictions of Q2.1 to p_brand1 between 'MFGR#2221' and 'MFGR#2228' and s_regrion to 'ASIA'
2.3 changes restriction to p_brand1='MFGR#2339' and s_region='EUROPE'
Query flight 3, has restrictions on 3 dimensions. The query is intended to retrieve total revenue for lineorder transactions within and given region in a certain time period, grouped by customer nation, supplier
nation and year.
Q3.1 has restriction c_region = 'ASIA', s_region='ASIA', and restricts d_year to a 6-year period, grouped by c_nation, s_nation and d_year
3.2 changes region restrictions to c_nation = ""UNITED STATES' and s_nation = 'UNITED STATES', grouping revenue by customer city, supplier city and year.
3.3 changes restrictions to c_city and s_city to two cities in 'UNITED KINGDOM' and retrieves revenue grouped by c_city, s_city, d_year.
3.4 changes date restriction to a single month. After partitioning the 12 billion row dataset on d_yearmonth, we needed to rewrite the query for d_yearmonthnum
Query flight 4 provides a ""what-if"" sequence of queries that might be generated in an OLAP-style of exploration. Starting with a query with rather weak constraints on three dimensional columns, we retreive
aggregate profit, sum(lo_revenue-lo_supplycost), grouped by d_year and c_nation. Successive queries modify predicate constraints by drilling down to find the source of an anomaly.
Q4.1 restricts c_region and s_region both to 'AMERICA', and p_mfgr to one of two possilities.
Q4.2 utilizes a typical workflow to dig deeper into the results. We pivot away from grouping by s_nation, restrict d_year to 1997 and 1998, and drill down to group by p_category to see where the profit change arises.
Q4.3 digs deeper, restricting s_nation to 'UNITED STATES' and p_category = 'MFGR#14', drilling down to group by s_city (in the USA) and p_brand1 (within p_category 'MFGR#14').

Query Optimization
● Date! Date! Date!
Biggest impacts in
optimization came
from aligning date
as ingested with
anticipated
queries.
● Optimize SQL
expressions
● Vectorize
Query Optimization Stage Query 4.3
SSB (Original) select d_year, s_city, p_brand1, sum(lo_revenue -
lo_supplycost) as profit from denormalized where
s_nation = 'UNITED STATES' and (d_year = 1997
or d_year = 1998) and p_category = 'MFGR#14'
group by d_year, s_city, p_brand1 order by d_year,
s_city, p_brand1
Apache Druid select d_year, s_nation, p_category,
sum(lo_revenue) - sum(lo_supplycost) as profit from
${jmDataSource} where c_region = 'AMERICA' and
s_region = 'AMERICA' and (FLOOR("__time" to
YEAR) = TIME_PARSE('1997-01-
01T00:00:00.000Z') or FLOOR("__time" to YEAR)
= TIME_PARSE('1998-01-01T00:00:00.000Z')) and
(p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group
by d_year, s_nation, p_category order by d_year,
s_nation, p_category

Explain Plan
EXPLAIN PLAN FOR
SELECT d_year, s_city, p_brand1, sum(lo_revenue –
lo_supplycost) as profit
FROM ssb_data
WHERE s_nation = 'UNITED STATES' and (d_year =
1997 or d_year = 1998) and p_category = 'MFGR#14'
GROUP BY d_year, s_city, p_brand1
ORDER BY d_year, s_city, p_brand1

Now Go Do It Yourself!
● Spec out your test project thoroughly
● Representative Data
● Representative Queries
● Install a small cluster (Quickstart)
● Ingest and tune
● Query via console for functional testing
● Install Jmeter (on query server and locally)
● Run queries against the HTTP API (no GUI, query server)
● Change, rerun, measure differences and learn
● The best way to learn is to just do it!

Resources
● Druid.apache.org
● Druid.apache.org/community
● ASF #druid Slack channel
● Jmeter.apache.org
● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF
● https://github.com/lemire/StarSchemaBenchmark
● https://github.com/implydata/benchmark-tools

Benchmarking Apache Druid

More Related Content

What's hot

Similar to Benchmarking Apache Druid

More from Imply

Recently uploaded

Benchmarking Apache Druid