InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

Query Processing in
InfluxDB IOx
SQL, Storge gRPC, Reorganization
2021-10-13
CC BY-SA
Andrew Lamb

© 2021 InﬂuxData. All rights reserved.
2
Today: IOx Team at InfluxData
Past life 1: Query Optimizer @
Vertica, also on Oracle DB server
Past life 2: Chief Architect + VP
Engineering roles at some ML startups

3
Talk Outline
‒ Data Model and Storage Review
‒ Query Processing Overview
‒ Frontends
‒ Execution Plans

4
Data Layout and
Storage

5
Data Organization: Partitions
Partitions define how data is kept
in separate Chunks in storage.
Each chunk logically stores part of
a partition for one or more
Relational Tables
Partitioning is used for for
1. Data Lifecycle Management
(drop whole partitions by deleting files)
2. Query Performance
(partition pruning)
Each row mapped to a single
Partition based on Partition Rules
cpu table disk table requests table
Jan 1 Jan 2
Jan 3
Jan 1
Jan 3
Jan 1 Jan 2

6
Data Organization: Chunks
Chunk0
(closed)
Chunk1
(closed)
Chunk4
open
Within each partition within a table,
data is divided into physical
chunks, identified with a chunk id
and a chunk order.
Chunks with lower order have older
(by insert time) data.
There is at most one open chunk
for each partition. All new data
(including deletes + updates) is
written into the open chunk
Once a chunk is closed, it becomes
immutable: rows are never
added/removed. The data is
compacted / persisted over time
into new chunks and the old chunk
dropped
Chunk2
(closed)
Chunk3
(closed)
New data is written into
the open chunk
Closed chunks are
ordered by age of data,
and never modified

7
Data Model
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z

8
Query Processing

9
Design: One Execution Engine
1. Query and Data Reorganization*: two sides of the same coin “moving data around”
2. Reuse as much existing execution machinery (e.g. streaming, segregated worker pool, etc)
3. Amplify investment by leveraging Open Source (and contribute back)
⇒ All queries run through a unified planning system based
on DataFusion + Arrow
Execute as Rust async streams (`RecordBatchStream`)
using tokio executor
* Putting data in physical structures (Chunks)

10
Query Processing IOx
Storage gRPC
Frontend
SQL Frontend
(from DataFusion)
Optimization
(storage pruning, pushdown,
etc)
Physical Planning
Execution
gRPC output Arrow Flight IPC
Query Input
Client / Language Specific
Frontends
Shared Planning, Execution
Phases, based on DataFusion
Client Specific
Output formats
read_group(.
.)
SELECT … FROM
...
DataFusio
n
LogicalPla
n
Arrow
Record
Batches
Reorg Frontend
compact_plan(..
)
ReadBuffer /
ParquetWriter
SeriesFrame
...
FlightData
Write to
ReadBufer or
Parquet files
DataFusio
n
LogicalPla
n
DataFusion
Physical
Plan

11
IOx Query Optimization Features
“Classic”: Projection/Filter/Limit pushdown, partial eval, ...
Predicate Evaluation During Scan
Chunk Pruning on predicates
Parquet Row Group Pruning
Grouping/Aggregate Pushdown
Filters pushed down on some
metadata queries, scans in
ReadBuffer
DataFusion IOx
N/A
ReadBuffer has support, but no
query engine support

12
Front Ends
(Logical Plans)

13
SQL Frontend
Arrow Flight
client
IOx
Port 8082
Object Store
2. IOx answers
queries by
combining data
from parquet files
+ in memory
cache
1. Flight request sent
3. Response
streamed back
via flight RPC
See DataFusion: An Embeddable Query Engine Written in Rust for more details

14
SQL: Logical Plan
SELECT cpu, usage_user, time
FROM cpu
WHERE cpu = 'cpu1';
TableScan is accomplished via
IOxReadFilterNode
.
Chunks are presented to as DataFusion
“partitions” (different than IOx partitions)
IOx query engine handles resolving upserts and
deletes
results
Filter:
#cpu Eq Utf8(“cpu1”)
TableScan: cpu
projection= Some([0,1,2,12])
Projection:
#cpu, #usage_user, #time

15
Reorg / Life Cycle Planner
Chunk2
Chunk1
Chunk3
Compact Plan
resolves upserts /
applies deletes
RecordBatch stream
Compact lifecycle operation
writes Stream to new Read
Buffer (RUB) or Parquet chunk
Chunk2
Chunk1
Chunk3
Split Plan:
resolves upserts /
applies deletes
RecordBatch stream
Persist lifecycle operation writes
streams to Parquet chunk and RUB
respectively
RecordBatch stream
time <= split_time
time > split_time
Compact
Split (split_time)

16
Reorg / Life Cycle Planner
Compact Plan
TableScan: cpu
Chunks = ...
Split Plan
TableScan: cpu
Chunks = ...
(2 DataFusion partitions)
StreamSplit split_time=1004
DataFusion
extension Node

17
Storage gRPC frontend: Flux and InfluxQL
Flux
Runtime
InfluxQL
IOx
Port 8082
Object Store
2. IOx answers
queries by
combining data
from parquet files
+ in memory
cache
1. Flux/InfluxQL
send requests
via gRPC
3. Response
streamed back
via gRPC

18
Storage gRPC Operations
‒ ReadFilter: Scan data out of IOx matching predicates
‒ ReadGroup: Groups/aggregates in IOx returning grouped data
‒ ReadWindowAggregate: Similar to ReadGroup, but windowed by time
‒ TagKeys: tag keys (column names) with data matching predicates.
‒ TagValues: distinct tag values (column values) with data matching predicates for a set
of tag keys (columns).
‒ MeasurementNames: table names that satisfy some provided predicate.
‒ MeasurementTagKeys: Same as TagKeys but limited to a single table.
‒ MeasurementTagValues: Same as TagValues but limited to a single table.
‒ MeasurementFields: field names (column names) with rows matching predicate
DataQuery Returns Time Series MetadataQuery Returns Strings / times
(thanks @e-dard)

19
Metadata Queries
meta data queries are incredibly common and often done on more recent data
‒ measurement_names(range, predicate)
‒ tag_keys(range, predicate)
‒ tag_values(tag_key, range, predicate)
Metadata Query
Fast path for
predicates
?
* Read Buffer (RUB) in particular is heavily optimized for metadata queries and rarely need general purpose plans.
YES: Answer with optimized*
implementation in chunk.
NO: Run general purpose
(DataFusion) plan

20
tag_keys (general)
tag_keys
pred: cpu ~= ‘.*total’
ts_range:[1000, 2000]
results
Filter:
#cpu =~ ‘.*total’ AND 1000 < #time
AND #time > 2000
TableScan: cpu
SchemaPivot
DataFusion extension Node
Produces a single output
String column with the
name of any input column
that had a non null value

21
Handling multiple tables
tag_keys
pred: host = ‘foo’
ts_range:[1000, 2000]
Filter:
1000 < #time AND
#time > 2000 AND host
= ‘foo’
TableScan: cpu
SchemaPivot
SeriesSetPlan for cpu
SeriesSetPlan(LogicalPlan)
Filter:
1000 < #time AND
#time > 2000 AND host
= ‘foo’
TableScan: mem
SchemaPivot
SeriesSetPlan for mem
{}
SeriesSetPlan for host
(no data between 1000 and
2000)
Results from multiple plans and sets are
combined at higher level

22
read_filter
pred: tag(_m)=”system” AND tag(_f)=”usage_user” AND tag(cpu)=”cpu1”
ts_range: [1000, 2000)
read_filter: Logical Plan
IOx code creates DataFusion
LogicalPlan nodes
Filter:
1000 < #time AND
#time > 2000 AND
Sort: (#cpu, #host, #time)
TableScan: system
Projection:
#cpu, #host, #usage_user, #time
TableScan is accomplished via same
IOxReadFilterNode
.
Predicates are applied using a Filter
Sort data in tag key order, as expected by Flux /
InfluxQL

23
read_filter:
Physical Plan
FilterExec:
1000 < #time AND
#time > 2000 AND
Sort: (#cpu, #host, #time)
IOxReadFilterNode
ProjectionExec:
#cpu, #host, #usage_user, #time
CoalescePartitionsExec
CoalescePartitionsExec does
not preserve sort order
Added by DataFusion physical
planning due to requirements from
Sort
FilterExec:
1000 < #time AND
#time > 2000 AND
IOxReadFilterNode
FilterExec:
1000 < #time AND
#time > 2000 AND
IOxReadFilterNode:
….
Repeated for
each chunk
PartitionChunk
(mutable_buffer)
PartitionChunk
(read_buffer)
PartitionChunk
(read_buffer)
Calls
PartitionChunk::read_filter
During execution to get
results

24
read_group
read_group
pred: tag(_m)=”cpu”
agg: first
group_keys: “env”
ts_range: [1000, 2000)
Assumes env and cpu are tags
results
Filter:
1000 < #time AND #time > 2000
Sort:
env, cpu, _time, _value
TableScan: cpu
Projection:
env, cpu, _time, _value
GroupBy:
gby: env, cpu
agg: first.time(usage_user, time) as _time
first.value(usage_user, time) as _value

25
Execution Plans

26
Table Scan: Reading data from a Chunk
TableScan: cpu
IOxReadFilterNode
chunk_id = 1
LogicalPlan ExecutionPlan SendableRecordBatchStream
SchemaAdapterStream
{MUB,RUB,Parquet}Stream
For a single chunk
* Chunk that has no deletes or possible
updates

27
Schema Adapter Stream
SchemaAdapterStream
output_schema: {A, B, C}
A C
1 10
2 20
3 30
4 40
Input RecordBatch
A B C
1 NULL 10
2 NULL 20
3 NULL 30
4 NULL 40
Output RecordBatch
Missing columns are padded with
nulls

28
Read Time Resolution of Updates/Upserts
Chunks that potentially
have updates (overlaps)
to primary keys but
different sort orders
TableScan: cpu
IOxReadFilterNode
chunk_id = 1
LogicalPlan
ExecutionPlan
Simplified -- real
plans handle partial
overlap scenarios;
See source
documentation for
more details
IOxReadFilterNode
chunk_id = 7
UnionExec
SortPreservingMerge
DeduplicateExec
IOx extension that implements tag key
deduplication
SortExec(optional)
Sort_key: tags
SortExec(optional)
Sort_key: tags
May have to resort on
primary key columns
Classic N-way merge
Doesn’t combine any DF partitions

29
Read Time Resolution of Deletes
IOxReadFilterNode
chunk_id = 1
ExecutionPlan
IOxReadFilterNode
chunk_id = 7
UnionExec
SortPreservingMerge
DeduplicateExec
SortExec(optional)
Sort_key: tags
SortExec(optional)
Sort_key: tags
Deletes can vary across chunks
Any delete predicates are also
applied in these scans (and thus
pushed down to MUB, RUB, etc)
as normal
FilterExec
time < 2021-10-01
Delete where
time < 2021-10-01

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

Similar to InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx