InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optimized In-Memory Query Execution Engine

Paul Dix
InfluxData – CTO & co-founder
paul@influxdata.com
@pauldix
InfluxDB IOx - a new columnar
time series database (update)

Progress
• New Team Members!
• Read Buffer progress
• Mutable Buffer & Read Buffer connections
• Arrow Flight API
• Replication, multiple IOx servers doc

API Decisions
• Management API will be gRPC
– CLI for common tasks
• Write
– InfluxDB 2.0 Line Protocol
– JSON objects (events!)
– Protobuf?
• Query
– HTTP (csv, json, display)
– Arrow Flight
– Postgres?

What’s Next?
• Management API
• Parquet Persistence to Object Store
• Recovery from Object Store
• Replication
• Subscriptions
• Official Builds & Documentation (now late March)

Edd Robinson
Engineer @ InfluxData
edd@influxdata.com
@e-dard 🐙
@eddrobinson 🐦
An Intro to the InfluxDB IOx
Read Buffer: a read-optimised
in-memory execution engine

Me
● Software engineer at InﬂuxData.
● Worked on InﬂuxDB for ~4y: storage engine, write path, indexing.
Working on IOx (and with Rust!) for just over a year.

What are we working towards?
● Unlimited Data:
○ Object Storage, compression
● Unlimited Cardinality:
○ Data organisation, no large
indexes.
● 🚀 Analytical Queries:
○ in-memory, columnar
data-layout, lots of fanciness

This talk is about...
A sub-system in IOx called the Read Buffer, a new query execution engine.
● Work on data held in-memory and on-heap. No IO at read-time
● Data is immutable.
● Lots of wholesome column-store goodness:
○ 📊
○ 🗜
○ ⇶
○ ❓
○ ❓

Wider Goals
We want to have excellent support for different time-series
use-cases
● Events
● Observability trifecta (logging, tracing, metrics)
● Large analytical workloads

We already have a time-series database?

Quick Refresher
●
●
●
●

InﬂuxDB Sad 🐼
~77 MB .
👎

●
●
● mmap -
●
IOx Bets

Why columnar is the way to go
● Analytical workloads usually only need
projections of dataset.
● Increase ﬂexibility in data organisation.
● Improve data relevance.
● Reduce footprint through compression.
● Mechanical sympathy - CPUs love arrays.
Forrest Smith - blog

Memory Bandwidth: benchmark
● This example is synthetic (but indicative!)
● Data throughput from memory to CPU has an
impact on performance.
● CPU cache is signiﬁcantly faster than main memory

L1 Cache
L2/L3 Cache
Main Memory
Memory Bandwidth: benchmark
● This example is synthetic (but indicative)!
● Data throughput from memory to CPU has an
impact on performance.
● CPU cache is signiﬁcantly faster than main memory
If you want to make the most use of your memory
bandwidth:
● process less data.
● process more relevant data.
Columnar representations help with both of these

🤿 Dive into the Read Buﬀer
● Data organisation;
● Data representation;
● Read execution (late materialisation);
● Early numbers!
● Future improvements.

● WAL: replication and recovery
● Mutable Buﬀer: query written data
● Object Store: for durability
● Read Buﬀer: optised read-only view
of written data.
IOx Write Path

IOx Read Path
Query Engine
SQL Frontend
Flux Frontend
InfluxQL Frontend
Mutable Buffer
Read Buffer
Object Storage
Reader

IOx Read Path
Query Engine
SQL Frontend
Flux Frontend
… Frontend
Mutable Buffer
Read Buﬀer
Object Storage
Reader

Data Model
Data organised by database

Data Model
Databases are collections of
partitions
Partition Key

Chunk ID
Data Model
Partitions contain chunks

Table name
Data Model
Chunks contain Tables

Data Model
Tables contain Row Groups
Same Schema
Filter entire tables

Data Model
Row Groups contain columnar data
Skip Row Group

Data Model
(thanks @alamb)
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
Row Group in Table: weather

Supported Data Types
Logical Data Types
● String (utf-8 valid strings)
● Float (double-precision float)
(all of them 😉)
● Integer (signed integers)
● Unsigned (unsigned integers)
● Boolean
● Binary (arbitrary bytes)
Semantic Column Types
● InfluxDB Tag ➟ String
● InfluxDB Field ➟ Most
● InfluxDB Timestamp ➟ I64
● IOx Column ➟ Anything

Tailored for time-series:
● scans, grouped aggregates, windowed aggregates, schema
exploration (tables, columns, values).
● Table/row group pruning.
● Predicate pushdown.
● Comparator operators with constant on tag columns
(<, <=, >, >=, =, !=}
● Aggregates any column(s)
Interesting Supported Features

Storing Data in the Read Buﬀer
➡

Columnar Compression Spectrum
Lots ‘o Compression
💯 Smaller Footprint
👎 High processing cost
No Compression
👎 Larger footprint
💯 ~Zero processing cost

Columnar Compression Spectrum
Lots ‘o Compression
Smaller Footprint
High processing cost
No Compression
Larger footprint
~Zero processing cost
Vec<T>

Choice can depend on data location

And Medium $$$
Petabytes
$0.03/GB
Gigabytes
$10/GB??
Terabytes
$0.1/GB

Read Buﬀer Compression Schemes
Dictionary Encoding
● Good for high cardinality tag
columns.
● Column order not factor in
compression.
● Constant time access. 🚀
● Key: Operate directly on
compressed data. 🚀

Filtering Dictionary Encoding
WHERE “region” = ‘east’
x = 0
{0, 2, 7, 15}
WHERE “region” > ‘north’
x > 1
{1, 3, 5, 8, 9, 10,
11, 12, 14}

“RLE” - Run-Length Encoding
● Incredible compression when lots
of “runs”.
● Works best on heavily sorted
columns.
● Not as consumable*
● Pre-computed bitsets 🚀
● Can operate on compressed
data. 🚀

“RLE” - Run-Length Encoding
x = 0
WHERE “region” > ‘north’
x > 1
{9, 10, 11, 12, 13,
14, 15}

Which Dictionary Encoding?
● 10M rows in column
● Cardinality 10,000
● Single thread
Billions rows/second processed

● 10M rows in column.
● Cardinality 10,000.
● Single thread.
● SIMD intrinsics on Dictionary Encoding.
● RLE is on another level: “cheating”...
RLE
59ms 2.2ms 420ns
380MB ~40MB ~40MB

WHERE “span_id” = ‘123djk7GHs99wj’
● 10 million rows in column.
● Cardinality 10 million.
● Single thread.
● SIMD intrinsics on Dictionary Encoding.
RLE
60ms 2.2ms
380MB ~420MB
580ns
~1GB

“I need rows [2, 33, 55, 111, 3343]”
10,000,000 row column
Encoding Cardinality 10K
(materialise 1000 rows near end)
Cardinality 10M
(materialise 1 row near end)
Vec<String>
Dictionary μ
RLE μ

●
● ﬁltering
●
materialisation

Numerical Column Encodings
Supported Logical types: i64, u64, f64
{u8, i8,.., u64, i64}*
&[i64]: (48 B) [123, 198, 1, 33, 133, 224] ➠ &[u8]: (6 B) [..]
&[i64]: (48 B) [-18, 2, 0, 220, 2, 26] ➠ &[i16]: (12 B) [..]

Numerical Column Encodings
●
●
●
●

Read Execution
SELECT “host”, “counter”, “time”
FROM “cpu”
WHERE “env” = ‘prod’ AND
“path” = ‘/write’ AND
“counter” > 200 AND
“time” >= x AND “time” < y;
●
●
●
●

Late Materialisation - Scanning
SELECT “host”, “counter”, “time” FROM “cpu” WHERE “env” = ‘prod’ AND “path” = ‘/write’ AND “counter” > 200 AND “time” >= x AND “time” < y;

Late Materialisation - Grouping
SELECT SUM(“counter”) FROM “cpu” WHERE “path” = ‘/query’ AND “time” >= x AND “time” < y GROUP BY “region”;
♥

Let’s look at some initial numbers

●
●
span_id
●
●
●
Synthetic High Cardinality Tracing use-case
Column Name Cardinality Encoding

How much space do we need?
●
●
●

1 M 1 ms 1.2 ms
10 M 1.1 ms 2.5 ms
60 M 1.3 ms 15.7 ms
SELECT * FROM “traces” WHERE “trace_id” = ‘H7whivfl’;
●
● 🤔
● 💪
●
“Needle in a Haystack”

SELECT SUM(duration) FROM “traces” GROUP BY “trace_id”;
●
●
●
Aggregating over high-cardinality
1 M 30 s
(~10 GB RAM)
45 ms
(8 MB)
10 M 18 min
(140 GB RAM)
498 ms
(150 MB)
60 M D.N.F
(OOM)
4.3 s
(900MB)

SHOW TAG KEYS WHERE “cluster” = ‘cluster-2-2-3’
AND time >= x AND time < y ;
Schema Exploration
1 M 15 ms 12 μs
10 M 150 ms 47 μs
60 M 1.6 s 120 μs

Future Work
Lots more to do in Read Buffer land!
● Data-type support.
● More supported predicate, e.g., regex, LIKE, OR.
● More columnar encodings (e.g., time-series specific field encodings)
● Deletes support! (Proposal written up)
● Complete implementation of all physical operations.
● Performance - predicate caching, buffer pooling etc.
● Concurrent execution.

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optimized In-Memory Query Execution Engine

More Related Content

What's hot

Similar to InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optimized In-Memory Query Execution Engine

More from InfluxData

Recently uploaded

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optimized In-Memory Query Execution Engine