Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 130

Inside the InfluxDB storage engine

14

Share

Download to read offline

In this presentation, I take a deep dive into the InfluxDB open source storage engine. More than just a single storage engine, InfluxDB is two engines in one: the first for time series data and the second, an index for metadata. I'll delve into the optimizations for achieving high write throughput, compression and fast reads for both the raw time series data and the metadata.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Inside the InfluxDB storage engine

  1. 1. © 2017 InfluxData. All rights reserved.1 Inside the InfluxDB Storage Engine Gianluca Arbezzano gianluca@influxdb.com @gianarb
  2. 2. © 2017 InfluxData. All rights reserved.2
  3. 3. © 2017 InfluxData. All rights reserved.3 What is time series data?
  4. 4. © 2017 InfluxData. All rights reserved.4 Stock trades and quotes
  5. 5. © 2017 InfluxData. All rights reserved.5 Metrics
  6. 6. © 2017 InfluxData. All rights reserved.6 Analytics
  7. 7. © 2017 InfluxData. All rights reserved.7 Events
  8. 8. © 2017 InfluxData. All rights reserved.8 Sensor data
  9. 9. Traces
  10. 10. © 2017 InfluxData. All rights reserved.10 Two kinds of time series data…
  11. 11. © 2017 InfluxData. All rights reserved.11 Regular time series t0 t1 t2 t3 t4 t6 t7 Samples at regular intervals
  12. 12. © 2017 InfluxData. All rights reserved.12 Irregular time series t0 t1 t2 t3 t4 t6 t7 Events whenever they come in
  13. 13. © 2017 InfluxData. All rights reserved.13 Why would you want a database for time series data?
  14. 14. © 2017 InfluxData. All rights reserved.14 Scale
  15. 15. © 2017 InfluxData. All rights reserved.15 Example from server monitoring • 2,000 servers, VMs, containers, or sensor units • 1,000 measurements per server/unit • every 10 seconds • = 17,280,000,000 distinct points per day
  16. 16. © 2017 InfluxData. All rights reserved.16 Compression
  17. 17. © 2017 InfluxData. All rights reserved.17 Aging out data
  18. 18. © 2017 InfluxData. All rights reserved.18 Downsampling
  19. 19. © 2017 InfluxData. All rights reserved.19 Fast range queries
  20. 20. Two Databases…
  21. 21. © 2017 InfluxData. All rights reserved.21 TSDB
  22. 22. © 2017 InfluxData. All rights reserved.22 Inverted Index
  23. 23. preliminary intro materials…
  24. 24. © 2017 InfluxData. All rights reserved.24 Everything is indexed by time and series
  25. 25. © 2017 InfluxData. All rights reserved.25 Shards 10/11/2015 10/12/2015 Data organized into Shards of time, each is an underlying DB efficient to drop old data 10/13/201510/10/2015
  26. 26. © 2017 InfluxData. All rights reserved.26 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  27. 27. © 2017 InfluxData. All rights reserved.27 InfluxDB data Measurement temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  28. 28. © 2017 InfluxData. All rights reserved.28 InfluxDB data Measurement Tags temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  29. 29. © 2017 InfluxData. All rights reserved.29 InfluxDB data Measurement Tags Fields temperature,device=dev1,building=b1 internal=80,external=18 1443782126
  30. 30. © 2017 InfluxData. All rights reserved.30 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Tags (tagset all together) Fields Timestamp
  31. 31. © 2017 InfluxData. All rights reserved.31 InfluxDB data temperature,device=dev1,building=b1 internal=80,external=18 1443782126 Measurement Fields Timestamp We actually store up to ns scale timestamps but I couldn’t fit on the slide Tags (tagset all together)
  32. 32. © 2017 InfluxData. All rights reserved.32 Each series and field to a unique ID temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external 1 2
  33. 33. © 2017 InfluxData. All rights reserved.33 Data per ID is tuples ordered by time temperature,device=dev1,building=b1#internal temperature,device=dev1,building=b1#external 1 2 1 (1443782126,80) 2 (1443782126,18)
  34. 34. © 2017 InfluxData. All rights reserved.34 Arranging in Key/Value Stores 1,1443782126 Key Value 80 ID Time
  35. 35. © 2017 InfluxData. All rights reserved.35 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18
  36. 36. © 2017 InfluxData. All rights reserved.36 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 new data
  37. 37. © 2017 InfluxData. All rights reserved.37 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 key space is ordered
  38. 38. © 2017 InfluxData. All rights reserved.38 Arranging in Key/Value Stores 1,1443782126 Key Value 80 2,1443782126 18 1,1443782127 81 2,1443782256 15 2,1443782130 17 3,1443700126 18
  39. 39. Many existing storage engines have this model
  40. 40. © 2017 InfluxData. All rights reserved.40 New Storage Engine?!
  41. 41. © 2017 InfluxData. All rights reserved.41 First we used LSM Trees
  42. 42. © 2017 InfluxData. All rights reserved.42 deletes expensive
  43. 43. © 2017 InfluxData. All rights reserved.43 too many open file handles
  44. 44. © 2017 InfluxData. All rights reserved.44 Then mmap COW B+Trees
  45. 45. © 2017 InfluxData. All rights reserved.45 write throughput
  46. 46. © 2017 InfluxData. All rights reserved.46 compression
  47. 47. © 2017 InfluxData. All rights reserved.47 met our requirements
  48. 48. © 2017 InfluxData. All rights reserved.48 High write throughput
  49. 49. © 2017 InfluxData. All rights reserved.49 Awesome read performance
  50. 50. © 2017 InfluxData. All rights reserved.50 Better Compression
  51. 51. © 2017 InfluxData. All rights reserved.51 Writes can’t block reads
  52. 52. © 2017 InfluxData. All rights reserved.52 Reads can’t block writes
  53. 53. © 2017 InfluxData. All rights reserved.53 Write multiple ranges simultaneously
  54. 54. Hot backups
  55. 55. © 2017 InfluxData. All rights reserved.55 Many databases open in a single process
  56. 56. © 2017 InfluxData. All rights reserved.56 Enter InfluxDB’s Time Structured Merge Tree (TSM Tree)
  57. 57. © 2017 InfluxData. All rights reserved.57 Enter InfluxDB’s Time Structured Merge Tree (TSM Tree) like LSM, but different
  58. 58. © 2017 InfluxData. All rights reserved.58 Components WAL In memory cache Index Files
  59. 59. © 2017 InfluxData. All rights reserved.59 Components WAL In memory cache Index Files Similar to LSM Trees
  60. 60. © 2017 InfluxData. All rights reserved.60 Components WAL In memory cache Index Files Similar to LSM Trees Same
  61. 61. © 2017 InfluxData. All rights reserved.61 Components WAL In memory cache Index Files Similar to LSM Trees Same like MemTables
  62. 62. © 2017 InfluxData. All rights reserved.62 Components WAL In memory cache Index Files Similar to LSM Trees Same like MemTables like SSTables
  63. 63. © 2017 InfluxData. All rights reserved.63 awesome time series data WAL (an append only file)
  64. 64. © 2017 InfluxData. All rights reserved.64 awesome time series data WAL (an append only file) in memory index
  65. 65. © 2017 InfluxData. All rights reserved.65 awesome time series data WAL (an append only file) in memory index on disk index (periodic flushes)
  66. 66. © 2017 InfluxData. All rights reserved.66 awesome time series data WAL (an append only file) in memory index on disk index (periodic flushes) Memory mapped!
  67. 67. © 2017 InfluxData. All rights reserved.67 TSM File
  68. 68. © 2017 InfluxData. All rights reserved.68 TSM File
  69. 69. © 2017 InfluxData. All rights reserved.69 TSM File
  70. 70. © 2017 InfluxData. All rights reserved.70 TSM File
  71. 71. © 2017 InfluxData. All rights reserved.71 TSM File
  72. 72. © 2017 InfluxData. All rights reserved.72 TSM File
  73. 73. © 2017 InfluxData. All rights reserved.73 Compression
  74. 74. © 2017 InfluxData. All rights reserved.74 Timestamps: encoding based on precision and deltas
  75. 75. © 2017 InfluxData. All rights reserved.75 Timestamps (best case): Run length encoding Deltas are all the same for a block
  76. 76. © 2017 InfluxData. All rights reserved.76 Timestamps (good case): Simple8B Ann and Moffat in "Index compression using 64-bit words"
  77. 77. © 2017 InfluxData. All rights reserved.77 Timestamps (worst case): raw values nano-second timestamps with large deltas
  78. 78. © 2017 InfluxData. All rights reserved.78 float64: double delta Facebook’s Gorilla - google: gorilla time series facebook https://github.com/dgryski/go-tsz
  79. 79. © 2017 InfluxData. All rights reserved.79 booleans are bits!
  80. 80. © 2017 InfluxData. All rights reserved.80 int64 uses double delta, zig-zag zig-zag same as from Protobufs
  81. 81. © 2017 InfluxData. All rights reserved.81 string uses Snappy same compression LevelDB uses (might add dictionary compression)
  82. 82. © 2017 InfluxData. All rights reserved.82 Updates Write, resolve at query
  83. 83. © 2017 InfluxData. All rights reserved.83 Deletes tombstone, resolve at query & compaction
  84. 84. © 2017 InfluxData. All rights reserved.84 Compactions • Combine multiple TSM files • Put all series points into same file • Series points in 1k blocks • Multiple levels • Full compaction when cold for writes
  85. 85. © 2017 InfluxData. All rights reserved.85 Example Query select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host
  86. 86. © 2017 InfluxData. All rights reserved.86 Example Query select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host How to map to series?
  87. 87. © 2017 InfluxData. All rights reserved.87 Inverted Index!
  88. 88. © 2017 InfluxData. All rights reserved.88 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 series to ID
  89. 89. © 2017 InfluxData. All rights reserved.89 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] measurement to fields series to ID
  90. 90. © 2017 InfluxData. All rights reserved.90 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] measurement to fields host to values series to ID
  91. 91. © 2017 InfluxData. All rights reserved.91 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] region -> [west] measurement to fields host to values region to values series to ID
  92. 92. © 2017 InfluxData. All rights reserved.92 Inverted Index cpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 cpu -> [idle] host -> [A, B] region -> [west] cpu -> [1, 2] host=A -> [1] host=B -> [1] region=west -> [1, 2] measurement to fields host to values region to values series to ID postings lists
  93. 93. © 2017 InfluxData. All rights reserved.93 Index V1 • In-memory • Load on boot • Memory constrained • Slower boot times with high cardinality
  94. 94. © 2017 InfluxData. All rights reserved.94 Index V2
  95. 95. © 2017 InfluxData. All rights reserved.95 in memory index on disk index (do we already have?) time series meta data
  96. 96. © 2017 InfluxData. All rights reserved.96 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file)
  97. 97. © 2017 InfluxData. All rights reserved.97 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file) on disk indices (periodic flushes)
  98. 98. © 2017 InfluxData. All rights reserved.98 in memory index on disk index (do we already have?) time series meta data nope WAL (an append only file) on disk indices (periodic flushes) (compactions) on disk index
  99. 99. © 2017 InfluxData. All rights reserved.99
  100. 100. © 2017 InfluxData. All rights reserved.100 Index File Layout
  101. 101. © 2017 InfluxData. All rights reserved.101
  102. 102. © 2017 InfluxData. All rights reserved.102
  103. 103. © 2017 InfluxData. All rights reserved.103 Example Key Exists Lookup [ 76, 234, 129, 352 ] File locations
  104. 104. © 2017 InfluxData. All rights reserved.104 [ 76, 234, 129, 352 ] cpu,host=serverA,region=west#idle
  105. 105. © 2017 InfluxData. All rights reserved.105 [ 76, 234, 129, 352 ] cpu,host=serverA,region=west#idle
  106. 106. © 2017 InfluxData. All rights reserved.106 Robin Hood Hashing • Can fully load table • No linked lists for lookup • Perfect for read-only hashes
  107. 107. © 2017 InfluxData. All rights reserved.107 [ , , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths
  108. 108. © 2017 InfluxData. All rights reserved.108 [ , , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths A -> 0
  109. 109. © 2017 InfluxData. All rights reserved.109 [ A, , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths A -> 0
  110. 110. © 2017 InfluxData. All rights reserved.110 [ A, , , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths B -> 1
  111. 111. © 2017 InfluxData. All rights reserved.111 [ A, B, , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths B -> 1
  112. 112. © 2017 InfluxData. All rights reserved.112 [ A, B, , , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths C -> 1
  113. 113. © 2017 InfluxData. All rights reserved.113 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 0, 0, 0 ] Keys Probe Lengths C -> 2
  114. 114. © 2017 InfluxData. All rights reserved.114 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths C -> probe 1
  115. 115. © 2017 InfluxData. All rights reserved.115 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths D -> 0
  116. 116. © 2017 InfluxData. All rights reserved.116 [ A, B, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 0, 1, 0, 0 ] Keys Probe Lengths D -> probe 1
  117. 117. © 2017 InfluxData. All rights reserved.117 [ A, D, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 0, 0 ] Keys Probe Lengths B -> probe 1
  118. 118. © 2017 InfluxData. All rights reserved.118 [ A, D, C, , ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 0, 0 ] Keys Probe Lengths B -> probe 2
  119. 119. © 2017 InfluxData. All rights reserved.119 [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths B -> probe 2
  120. 120. © 2017 InfluxData. All rights reserved.120 Rob probe rich, give to probe poor
  121. 121. © 2017 InfluxData. All rights reserved.121 Refinement: average probe
  122. 122. © 2017 InfluxData. All rights reserved.122 Cache Hit [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe LengthsAverage: 1
  123. 123. © 2017 InfluxData. All rights reserved.123 Cache Hit [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe LengthsAverage: 1 D -> hashes to 0 + 1
  124. 124. © 2017 InfluxData. All rights reserved.124 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> hashes to 0
  125. 125. © 2017 InfluxData. All rights reserved.125 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> move probe 1
  126. 126. © 2017 InfluxData. All rights reserved.126 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Z -> move probe 2
  127. 127. © 2017 InfluxData. All rights reserved.127 Cache Miss [ A, D, C, B, ] [ 0, 1, 2, 3, 4 ] Positions [ 0, 1, 1, 2, 0 ] Keys Probe Lengths Max Probe 2, so Z not present
  128. 128. © 2017 InfluxData. All rights reserved.128 Cardinality Estimation
  129. 129. © 2017 InfluxData. All rights reserved.129 HyperLogLog++
  130. 130. Gianluca Arbezzano gianluca@influxdb.com @gianarb Thank you.

×