© 2017 InfluxData. All rights reserved.1
Inside the InfluxDB Storage
Engine
Gianluca Arbezzano
gianluca@influxdb.com
@gianarb
© 2017 InfluxData. All rights reserved.2
© 2017 InfluxData. All rights reserved.3
What is time series data?
© 2017 InfluxData. All rights reserved.4
Stock trades and quotes
© 2017 InfluxData. All rights reserved.5
Metrics
© 2017 InfluxData. All rights reserved.6
Analytics
© 2017 InfluxData. All rights reserved.7
Events
© 2017 InfluxData. All rights reserved.8
Sensor data
Traces
© 2017 InfluxData. All rights reserved.10
Two kinds of time series
data…
© 2017 InfluxData. All rights reserved.11
Regular time series
t0 t1 t2 t3 t4 t6 t7
Samples at regular intervals
© 2017 InfluxData. All rights reserved.12
Irregular time series
t0 t1 t2 t3 t4 t6 t7
Events whenever they come in
© 2017 InfluxData. All rights reserved.13
Why would you want a
database for time series
data?
© 2017 InfluxData. All rights reserved.14
Scale
© 2017 InfluxData. All rights reserved.15
Example from server monitoring
• 2,000 servers, VMs, containers, or sensor units
• 1,000 measurements per server/unit
• every 10 seconds
• = 17,280,000,000 distinct points per day
© 2017 InfluxData. All rights reserved.16
Compression
© 2017 InfluxData. All rights reserved.17
Aging out data
© 2017 InfluxData. All rights reserved.18
Downsampling
© 2017 InfluxData. All rights reserved.19
Fast range queries
Two Databases…
© 2017 InfluxData. All rights reserved.21
TSDB
© 2017 InfluxData. All rights reserved.22
Inverted Index
preliminary intro materials…
© 2017 InfluxData. All rights reserved.24
Everything is indexed by time and
series
© 2017 InfluxData. All rights reserved.25
Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DB
efficient to drop old data
10/13/201510/10/2015
© 2017 InfluxData. All rights reserved.26
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved.27
InfluxDB data
Measurement
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved.28
InfluxDB data
Measurement Tags
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved.29
InfluxDB data
Measurement Tags Fields
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved.30
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags
(tagset all
together)
Fields Timestamp
© 2017 InfluxData. All rights reserved.31
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Fields Timestamp
We actually store up to ns scale timestamps
but I couldn’t fit on the slide
Tags
(tagset all
together)
© 2017 InfluxData. All rights reserved.32
Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
© 2017 InfluxData. All rights reserved.33
Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
© 2017 InfluxData. All rights reserved.34
Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time
© 2017 InfluxData. All rights reserved.35
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
© 2017 InfluxData. All rights reserved.36
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81 new data
© 2017 InfluxData. All rights reserved.37
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
key space
is ordered
© 2017 InfluxData. All rights reserved.38
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
2,1443782256 15
2,1443782130 17
3,1443700126 18
Many existing
storage engines
have this model
© 2017 InfluxData. All rights reserved.40
New Storage Engine?!
© 2017 InfluxData. All rights reserved.41
First we used LSM Trees
© 2017 InfluxData. All rights reserved.42
deletes expensive
© 2017 InfluxData. All rights reserved.43
too many open file handles
© 2017 InfluxData. All rights reserved.44
Then mmap COW B+Trees
© 2017 InfluxData. All rights reserved.45
write throughput
© 2017 InfluxData. All rights reserved.46
compression
© 2017 InfluxData. All rights reserved.47
met our requirements
© 2017 InfluxData. All rights reserved.48
High write throughput
© 2017 InfluxData. All rights reserved.49
Awesome read performance
© 2017 InfluxData. All rights reserved.50
Better Compression
© 2017 InfluxData. All rights reserved.51
Writes can’t block reads
© 2017 InfluxData. All rights reserved.52
Reads can’t block writes
© 2017 InfluxData. All rights reserved.53
Write multiple ranges
simultaneously
Hot backups
© 2017 InfluxData. All rights reserved.55
Many databases open in a single
process
© 2017 InfluxData. All rights reserved.56
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
© 2017 InfluxData. All rights reserved.57
Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
like LSM, but different
© 2017 InfluxData. All rights reserved.58
Components
WAL
In
memory
cache
Index
Files
© 2017 InfluxData. All rights reserved.59
Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
© 2017 InfluxData. All rights reserved.60
Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same
© 2017 InfluxData. All rights reserved.61
Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same
like
MemTables
© 2017 InfluxData. All rights reserved.62
Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same
like
MemTables
like SSTables
© 2017 InfluxData. All rights reserved.63
awesome time series data
WAL (an append only file)
© 2017 InfluxData. All rights reserved.64
awesome time series data
WAL (an append only file)
in memory index
© 2017 InfluxData. All rights reserved.65
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
© 2017 InfluxData. All rights reserved.66
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Memory
mapped!
© 2017 InfluxData. All rights reserved.67
TSM File
© 2017 InfluxData. All rights reserved.68
TSM File
© 2017 InfluxData. All rights reserved.69
TSM File
© 2017 InfluxData. All rights reserved.70
TSM File
© 2017 InfluxData. All rights reserved.71
TSM File
© 2017 InfluxData. All rights reserved.72
TSM File
© 2017 InfluxData. All rights reserved.73
Compression
© 2017 InfluxData. All rights reserved.74
Timestamps: encoding based on
precision and deltas
© 2017 InfluxData. All rights reserved.75
Timestamps (best case):
Run length encoding
Deltas are all the same for a block
© 2017 InfluxData. All rights reserved.76
Timestamps (good case):
Simple8B
Ann and Moffat in "Index compression using 64-bit words"
© 2017 InfluxData. All rights reserved.77
Timestamps (worst case):
raw values
nano-second timestamps with large deltas
© 2017 InfluxData. All rights reserved.78
float64: double delta
Facebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
© 2017 InfluxData. All rights reserved.79
booleans are bits!
© 2017 InfluxData. All rights reserved.80
int64 uses double delta, zig-zag
zig-zag same as from Protobufs
© 2017 InfluxData. All rights reserved.81
string uses Snappy
same compression LevelDB uses
(might add dictionary compression)
© 2017 InfluxData. All rights reserved.82
Updates
Write, resolve at query
© 2017 InfluxData. All rights reserved.83
Deletes
tombstone, resolve at query & compaction
© 2017 InfluxData. All rights reserved.84
Compactions
• Combine multiple TSM files
• Put all series points into same file
• Series points in 1k blocks
• Multiple levels
• Full compaction when cold for writes
© 2017 InfluxData. All rights reserved.85
Example Query
select percentile(90, value) from cpu
where time > now() - 12h and “region” = ‘west’
group by time(10m), host
© 2017 InfluxData. All rights reserved.86
Example Query
select percentile(90, value) from cpu
where time > now() - 12h and “region” = ‘west’
group by time(10m), host
How to map to
series?
© 2017 InfluxData. All rights reserved.87
Inverted Index!
© 2017 InfluxData. All rights reserved.88
Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
series to ID
© 2017 InfluxData. All rights reserved.89
Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
cpu -> [idle] measurement to fields
series to ID
© 2017 InfluxData. All rights reserved.90
Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
measurement to fields
host to values
series to ID
© 2017 InfluxData. All rights reserved.91
Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
measurement to fields
host to values
region to values
series to ID
© 2017 InfluxData. All rights reserved.92
Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
cpu -> [1, 2]
host=A -> [1]
host=B -> [1]
region=west -> [1, 2]
measurement to fields
host to values
region to values
series to ID
postings lists
© 2017 InfluxData. All rights reserved.93
Index V1
• In-memory
• Load on boot
• Memory constrained
• Slower boot times with high cardinality
© 2017 InfluxData. All rights reserved.94
Index V2
© 2017 InfluxData. All rights reserved.95
in memory index on disk index (do we already have?)
time series meta data
© 2017 InfluxData. All rights reserved.96
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
© 2017 InfluxData. All rights reserved.97
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
© 2017 InfluxData. All rights reserved.98
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
(compactions)
on disk index
© 2017 InfluxData. All rights reserved.99
© 2017 InfluxData. All rights reserved.100
Index File Layout
© 2017 InfluxData. All rights reserved.101
© 2017 InfluxData. All rights reserved.102
© 2017 InfluxData. All rights reserved.103
Example Key Exists Lookup
[ 76, 234, 129, 352 ] File locations
© 2017 InfluxData. All rights reserved.104
[ 76, 234, 129, 352 ]
cpu,host=serverA,region=west#idle
© 2017 InfluxData. All rights reserved.105
[ 76, 234, 129, 352 ]
cpu,host=serverA,region=west#idle
© 2017 InfluxData. All rights reserved.106
Robin Hood Hashing
• Can fully load table
• No linked lists for lookup
• Perfect for read-only hashes
© 2017 InfluxData. All rights reserved.107
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
© 2017 InfluxData. All rights reserved.108
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A ->
0
© 2017 InfluxData. All rights reserved.109
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A ->
0
© 2017 InfluxData. All rights reserved.110
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B ->
1
© 2017 InfluxData. All rights reserved.111
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B ->
1
© 2017 InfluxData. All rights reserved.112
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C ->
1
© 2017 InfluxData. All rights reserved.113
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C ->
2
© 2017 InfluxData. All rights reserved.114
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
C -> probe
1
© 2017 InfluxData. All rights reserved.115
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D ->
0
© 2017 InfluxData. All rights reserved.116
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> probe
1
© 2017 InfluxData. All rights reserved.117
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe
1
© 2017 InfluxData. All rights reserved.118
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe
2
© 2017 InfluxData. All rights reserved.119
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
B -> probe
2
© 2017 InfluxData. All rights reserved.120
Rob probe rich, give to probe
poor
© 2017 InfluxData. All rights reserved.121
Refinement: average probe
© 2017 InfluxData. All rights reserved.122
Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
© 2017 InfluxData. All rights reserved.123
Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
D -> hashes to 0 +
1
© 2017 InfluxData. All rights reserved.124
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> hashes to
0
© 2017 InfluxData. All rights reserved.125
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe
1
© 2017 InfluxData. All rights reserved.126
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe
2
© 2017 InfluxData. All rights reserved.127
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Max Probe 2, so Z not
present
© 2017 InfluxData. All rights reserved.128
Cardinality Estimation
© 2017 InfluxData. All rights reserved.129
HyperLogLog++
Gianluca Arbezzano
gianluca@influxdb.com
@gianarb
Thank you.

Inside the InfluxDB storage engine

  • 1.
    © 2017 InfluxData.All rights reserved.1 Inside the InfluxDB Storage Engine Gianluca Arbezzano gianluca@influxdb.com @gianarb
  • 2.
    © 2017 InfluxData.All rights reserved.2
  • 3.
    © 2017 InfluxData.All rights reserved.3 What is time series data?
  • 4.
    © 2017 InfluxData.All rights reserved.4 Stock trades and quotes
  • 5.
    © 2017 InfluxData.All rights reserved.5 Metrics
  • 6.
    © 2017 InfluxData.All rights reserved.6 Analytics
  • 7.
    © 2017 InfluxData.All rights reserved.7 Events
  • 8.
    © 2017 InfluxData.All rights reserved.8 Sensor data
  • 9.
  • 10.
    © 2017 InfluxData.All rights reserved.10 Two kinds of time series data…
  • 11.
    © 2017 InfluxData.All rights reserved.11 Regular time series t0 t1 t2 t3 t4 t6 t7 Samples at regular intervals
  • 12.
    © 2017 InfluxData.All rights reserved.12 Irregular time series t0 t1 t2 t3 t4 t6 t7 Events whenever they come in
  • 13.
    © 2017 InfluxData.All rights reserved.13 Why would you want a database for time series data?
  • 14.
    © 2017 InfluxData.All rights reserved.14 Scale
  • 15.
    © 2017 InfluxData.All rights reserved.15 Example from server monitoring • 2,000 servers, VMs, containers, or sensor units • 1,000 measurements per server/unit • every 10 seconds • = 17,280,000,000 distinct points per day
  • 16.
    © 2017 InfluxData.All rights reserved.16 Compression
  • 17.
    © 2017 InfluxData.All rights reserved.17 Aging out data
  • 18.
    © 2017 InfluxData.All rights reserved.18 Downsampling
  • 19.
    © 2017 InfluxData.All rights reserved.19 Fast range queries
  • 20.
  • 21.
    © 2017 InfluxData.All rights reserved.21 TSDB
  • 22.
    © 2017 InfluxData.All rights reserved.22 Inverted Index
  • 23.
  • 24.
    © 2017 InfluxData.All rights reserved.24 Everything is indexed by time and series
  • 25.
    © 2017 InfluxData.All rights reserved.25 Shards 10/11/2015 10/12/2015 Data organized into Shards of time, each is an underlying DB efficient to drop old data 10/13/201510/10/2015
  • 26.
    © 2017 InfluxData.All rights reserved.26 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  • 27.
    © 2017 InfluxData.All rights reserved.27 InfluxDB data Measurement temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  • 28.
    © 2017 InfluxData.All rights reserved.28 InfluxDB data Measurement Tags temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  • 29.
    © 2017 InfluxData.All rights reserved.29 InfluxDB data Measurement Tags Fields temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  • 30.
    © 2017 InfluxData.All rights reserved.30 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags (tagset all together) Fields Timestamp
  • 31.
    © 2017 InfluxData.All rights reserved.31 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Fields Timestamp We actually store up to ns scale timestamps but I couldn’t fit on the slide Tags (tagset all together)
  • 32.
    © 2017 InfluxData.All rights reserved.32 Each series and field to a unique ID temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external 1 2
  • 33.
    © 2017 InfluxData.All rights reserved.33 Data per ID is tuples ordered by time temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external 1 2 1 (1443782126,80) 2 (1443782126,18)
  • 34.
    © 2017 InfluxData.All rights reserved.34 Arranging in Key/Value Stores 1,1443782126 Key Value 80 ID Time
  • 35.
    © 2017 InfluxData.All rights reserved.35 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18
  • 36.
    © 2017 InfluxData.All rights reserved.36 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 new data
  • 37.
    © 2017 InfluxData.All rights reserved.37 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 key space is ordered
  • 38.
    © 2017 InfluxData.All rights reserved.38 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 2,1443782256 15 2,1443782130 17 3,1443700126 18
  • 39.
  • 40.
    © 2017 InfluxData.All rights reserved.40 New Storage Engine?!
  • 41.
    © 2017 InfluxData.All rights reserved.41 First we used LSM Trees
  • 42.
    © 2017 InfluxData.All rights reserved.42 deletes expensive
  • 43.
    © 2017 InfluxData.All rights reserved.43 too many open file handles
  • 44.
    © 2017 InfluxData.All rights reserved.44 Then mmap COW B+Trees
  • 45.
    © 2017 InfluxData.All rights reserved.45 write throughput
  • 46.
    © 2017 InfluxData.All rights reserved.46 compression
  • 47.
    © 2017 InfluxData.All rights reserved.47 met our requirements
  • 48.
    © 2017 InfluxData.All rights reserved.48 High write throughput
  • 49.
    © 2017 InfluxData.All rights reserved.49 Awesome read performance
  • 50.
    © 2017 InfluxData.All rights reserved.50 Better Compression
  • 51.
    © 2017 InfluxData.All rights reserved.51 Writes can’t block reads
  • 52.
    © 2017 InfluxData.All rights reserved.52 Reads can’t block writes
  • 53.
    © 2017 InfluxData.All rights reserved.53 Write multiple ranges simultaneously
  • 54.
  • 55.
    © 2017 InfluxData.All rights reserved.55 Many databases open in a single process
  • 56.
    © 2017 InfluxData.All rights reserved.56 Enter InfluxDB’s Time Structured Merge Tree (TSM Tree)
  • 57.
    © 2017 InfluxData.All rights reserved.57 Enter InfluxDB’s Time Structured Merge Tree (TSM Tree) like LSM, but different
  • 58.
    © 2017 InfluxData.All rights reserved.58 Components WAL In memory cache Index Files
  • 59.
    © 2017 InfluxData.All rights reserved.59 Components WAL In memory cache Index Files Similar to LSM Trees
  • 60.
    © 2017 InfluxData.All rights reserved.60 Components WAL In memory cache Index Files Similar to LSM Trees Same
  • 61.
    © 2017 InfluxData.All rights reserved.61 Components WAL In memory cache Index Files Similar to LSM Trees Same like MemTables
  • 62.
    © 2017 InfluxData.All rights reserved.62 Components WAL In memory cache Index Files Similar to LSM Trees Same like MemTables like SSTables
  • 63.
    © 2017 InfluxData.All rights reserved.63 awesome time series data WAL (an append only file)
  • 64.
    © 2017 InfluxData.All rights reserved.64 awesome time series data WAL (an append only file) in memory index
  • 65.
    © 2017 InfluxData.All rights reserved.65 awesome time series data WAL (an append only file) in memory index on disk index (periodic flushes)
  • 66.
    © 2017 InfluxData.All rights reserved.66 awesome time series data WAL (an append only file) in memory index on disk index (periodic flushes) Memory mapped!
  • 67.
    © 2017 InfluxData.All rights reserved.67 TSM File
  • 68.
    © 2017 InfluxData.All rights reserved.68 TSM File
  • 69.
    © 2017 InfluxData.All rights reserved.69 TSM File
  • 70.
    © 2017 InfluxData.All rights reserved.70 TSM File
  • 71.
    © 2017 InfluxData.All rights reserved.71 TSM File
  • 72.
    © 2017 InfluxData.All rights reserved.72 TSM File
  • 73.
    © 2017 InfluxData.All rights reserved.73 Compression
  • 74.
    © 2017 InfluxData.All rights reserved.74 Timestamps: encoding based on precision and deltas
  • 75.
    © 2017 InfluxData.All rights reserved.75 Timestamps (best case): Run length encoding Deltas are all the same for a block
  • 76.
    © 2017 InfluxData.All rights reserved.76 Timestamps (good case): Simple8B Ann and Moffat in "Index compression using 64-bit words"
  • 77.
    © 2017 InfluxData.All rights reserved.77 Timestamps (worst case): raw values nano-second timestamps with large deltas
  • 78.
    © 2017 InfluxData.All rights reserved.78 float64: double delta Facebook’s Gorilla - google: gorilla time series facebook https://github.com/dgryski/go-tsz
  • 79.
    © 2017 InfluxData.All rights reserved.79 booleans are bits!
  • 80.
    © 2017 InfluxData.All rights reserved.80 int64 uses double delta, zig-zag zig-zag same as from Protobufs
  • 81.
    © 2017 InfluxData.All rights reserved.81 string uses Snappy same compression LevelDB uses (might add dictionary compression)
  • 82.
    © 2017 InfluxData.All rights reserved.82 Updates Write, resolve at query
  • 83.
    © 2017 InfluxData.All rights reserved.83 Deletes tombstone, resolve at query & compaction
  • 84.
    © 2017 InfluxData.All rights reserved.84 Compactions • Combine multiple TSM files • Put all series points into same file • Series points in 1k blocks • Multiple levels • Full compaction when cold for writes
  • 85.
    © 2017 InfluxData.All rights reserved.85 Example Query select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host
  • 86.
    © 2017 InfluxData.All rights reserved.86 Example Query select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host How to map to series?
  • 87.
    © 2017 InfluxData.All rights reserved.87 Inverted Index!
  • 88.
    © 2017 InfluxData.All rights reserved.88 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 series to ID
  • 89.
    © 2017 InfluxData.All rights reserved.89 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] measurement to fields series to ID
  • 90.
    © 2017 InfluxData.All rights reserved.90 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] measurement to fields host to values series to ID
  • 91.
    © 2017 InfluxData.All rights reserved.91 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] region -> [west] measurement to fields host to values region to values series to ID
  • 92.
    © 2017 InfluxData.All rights reserved.92 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] region -> [west] cpu -> [1, 2] host=A -> [1] host=B -> [1] region=west -> [1, 2] measurement to fields host to values region to values series to ID postings lists
  • 93.
    © 2017 InfluxData.All rights reserved.93 Index V1 • In-memory • Load on boot • Memory constrained • Slower boot times with high cardinality
  • 94.
    © 2017 InfluxData.All rights reserved.94 Index V2
  • 95.
    © 2017 InfluxData.All rights reserved.95 in memory index on disk index (do we already have?) time series meta data
  • 96.
    © 2017 InfluxData.All rights reserved.96 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file)
  • 97.
    © 2017 InfluxData.All rights reserved.97 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file) on disk indices (periodic flushes)
  • 98.
    © 2017 InfluxData.All rights reserved.98 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file) on disk indices (periodic flushes) (compactions) on disk index
  • 99.
    © 2017 InfluxData.All rights reserved.99
  • 100.
    © 2017 InfluxData.All rights reserved.100 Index File Layout
  • 101.
    © 2017 InfluxData.All rights reserved.101
  • 102.
    © 2017 InfluxData.All rights reserved.102
  • 103.
    © 2017 InfluxData.All rights reserved.103 Example Key Exists Lookup [ 76, 234, 129, 352 ] File locations
  • 104.
    © 2017 InfluxData.All rights reserved.104 [ 76, 234, 129, 352 ] cpu,host=serverA,region=west#idle
  • 105.
    © 2017 InfluxData.All rights reserved.105 [ 76, 234, 129, 352 ] cpu,host=serverA,region=west#idle
  • 106.
    © 2017 InfluxData.All rights reserved.106 Robin Hood Hashing • Can fully load table • No linked lists for lookup • Perfect for read-only hashes
  • 107.
    © 2017 InfluxData.All rights reserved.107 [ , , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths
  • 108.
    © 2017 InfluxData.All rights reserved.108 [ , , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths A -> 0
  • 109.
    © 2017 InfluxData.All rights reserved.109 [ A, , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths A -> 0
  • 110.
    © 2017 InfluxData.All rights reserved.110 [ A, , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths B -> 1
  • 111.
    © 2017 InfluxData.All rights reserved.111 [ A, B, , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths B -> 1
  • 112.
    © 2017 InfluxData.All rights reserved.112 [ A, B, , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths C -> 1
  • 113.
    © 2017 InfluxData.All rights reserved.113 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths C -> 2
  • 114.
    © 2017 InfluxData.All rights reserved.114 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths C -> probe 1
  • 115.
    © 2017 InfluxData.All rights reserved.115 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths D -> 0
  • 116.
    © 2017 InfluxData.All rights reserved.116 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths D -> probe 1
  • 117.
    © 2017 InfluxData.All rights reserved.117 [ A, D, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 0, 0 ] Keys Probe Lengths B -> probe 1
  • 118.
    © 2017 InfluxData.All rights reserved.118 [ A, D, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 0, 0 ] Keys Probe Lengths B -> probe 2
  • 119.
    © 2017 InfluxData.All rights reserved.119 [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths B -> probe 2
  • 120.
    © 2017 InfluxData.All rights reserved.120 Rob probe rich, give to probe poor
  • 121.
    © 2017 InfluxData.All rights reserved.121 Refinement: average probe
  • 122.
    © 2017 InfluxData.All rights reserved.122 Cache Hit [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe LengthsAverage: 1
  • 123.
    © 2017 InfluxData.All rights reserved.123 Cache Hit [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe LengthsAverage: 1 D -> hashes to 0 + 1
  • 124.
    © 2017 InfluxData.All rights reserved.124 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> hashes to 0
  • 125.
    © 2017 InfluxData.All rights reserved.125 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> move probe 1
  • 126.
    © 2017 InfluxData.All rights reserved.126 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> move probe 2
  • 127.
    © 2017 InfluxData.All rights reserved.127 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Max Probe 2, so Z not present
  • 128.
    © 2017 InfluxData.All rights reserved.128 Cardinality Estimation
  • 129.
    © 2017 InfluxData.All rights reserved.129 HyperLogLog++
  • 130.