Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Thanos: Global, durable Prometheus monitoring Slide 1 Thanos: Global, durable Prometheus monitoring Slide 2 Thanos: Global, durable Prometheus monitoring Slide 3 Thanos: Global, durable Prometheus monitoring Slide 4 Thanos: Global, durable Prometheus monitoring Slide 5 Thanos: Global, durable Prometheus monitoring Slide 6 Thanos: Global, durable Prometheus monitoring Slide 7 Thanos: Global, durable Prometheus monitoring Slide 8 Thanos: Global, durable Prometheus monitoring Slide 9 Thanos: Global, durable Prometheus monitoring Slide 10 Thanos: Global, durable Prometheus monitoring Slide 11 Thanos: Global, durable Prometheus monitoring Slide 12 Thanos: Global, durable Prometheus monitoring Slide 13 Thanos: Global, durable Prometheus monitoring Slide 14 Thanos: Global, durable Prometheus monitoring Slide 15 Thanos: Global, durable Prometheus monitoring Slide 16 Thanos: Global, durable Prometheus monitoring Slide 17 Thanos: Global, durable Prometheus monitoring Slide 18 Thanos: Global, durable Prometheus monitoring Slide 19 Thanos: Global, durable Prometheus monitoring Slide 20 Thanos: Global, durable Prometheus monitoring Slide 21 Thanos: Global, durable Prometheus monitoring Slide 22 Thanos: Global, durable Prometheus monitoring Slide 23 Thanos: Global, durable Prometheus monitoring Slide 24 Thanos: Global, durable Prometheus monitoring Slide 25 Thanos: Global, durable Prometheus monitoring Slide 26 Thanos: Global, durable Prometheus monitoring Slide 27 Thanos: Global, durable Prometheus monitoring Slide 28 Thanos: Global, durable Prometheus monitoring Slide 29 Thanos: Global, durable Prometheus monitoring Slide 30 Thanos: Global, durable Prometheus monitoring Slide 31 Thanos: Global, durable Prometheus monitoring Slide 32 Thanos: Global, durable Prometheus monitoring Slide 33 Thanos: Global, durable Prometheus monitoring Slide 34 Thanos: Global, durable Prometheus monitoring Slide 35 Thanos: Global, durable Prometheus monitoring Slide 36 Thanos: Global, durable Prometheus monitoring Slide 37 Thanos: Global, durable Prometheus monitoring Slide 38 Thanos: Global, durable Prometheus monitoring Slide 39 Thanos: Global, durable Prometheus monitoring Slide 40 Thanos: Global, durable Prometheus monitoring Slide 41 Thanos: Global, durable Prometheus monitoring Slide 42 Thanos: Global, durable Prometheus monitoring Slide 43 Thanos: Global, durable Prometheus monitoring Slide 44 Thanos: Global, durable Prometheus monitoring Slide 45 Thanos: Global, durable Prometheus monitoring Slide 46 Thanos: Global, durable Prometheus monitoring Slide 47 Thanos: Global, durable Prometheus monitoring Slide 48 Thanos: Global, durable Prometheus monitoring Slide 49 Thanos: Global, durable Prometheus monitoring Slide 50 Thanos: Global, durable Prometheus monitoring Slide 51 Thanos: Global, durable Prometheus monitoring Slide 52 Thanos: Global, durable Prometheus monitoring Slide 53 Thanos: Global, durable Prometheus monitoring Slide 54 Thanos: Global, durable Prometheus monitoring Slide 55 Thanos: Global, durable Prometheus monitoring Slide 56 Thanos: Global, durable Prometheus monitoring Slide 57 Thanos: Global, durable Prometheus monitoring Slide 58 Thanos: Global, durable Prometheus monitoring Slide 59 Thanos: Global, durable Prometheus monitoring Slide 60 Thanos: Global, durable Prometheus monitoring Slide 61 Thanos: Global, durable Prometheus monitoring Slide 62 Thanos: Global, durable Prometheus monitoring Slide 63 Thanos: Global, durable Prometheus monitoring Slide 64 Thanos: Global, durable Prometheus monitoring Slide 65 Thanos: Global, durable Prometheus monitoring Slide 66 Thanos: Global, durable Prometheus monitoring Slide 67 Thanos: Global, durable Prometheus monitoring Slide 68 Thanos: Global, durable Prometheus monitoring Slide 69 Thanos: Global, durable Prometheus monitoring Slide 70 Thanos: Global, durable Prometheus monitoring Slide 71 Thanos: Global, durable Prometheus monitoring Slide 72 Thanos: Global, durable Prometheus monitoring Slide 73 Thanos: Global, durable Prometheus monitoring Slide 74 Thanos: Global, durable Prometheus monitoring Slide 75 Thanos: Global, durable Prometheus monitoring Slide 76 Thanos: Global, durable Prometheus monitoring Slide 77 Thanos: Global, durable Prometheus monitoring Slide 78 Thanos: Global, durable Prometheus monitoring Slide 79 Thanos: Global, durable Prometheus monitoring Slide 80 Thanos: Global, durable Prometheus monitoring Slide 81 Thanos: Global, durable Prometheus monitoring Slide 82 Thanos: Global, durable Prometheus monitoring Slide 83 Thanos: Global, durable Prometheus monitoring Slide 84 Thanos: Global, durable Prometheus monitoring Slide 85 Thanos: Global, durable Prometheus monitoring Slide 86 Thanos: Global, durable Prometheus monitoring Slide 87
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

35 Likes

Share

Download to read offline

Thanos: Global, durable Prometheus monitoring

Download to read offline

Prometheus’s simple and reliable operational model is one of its major selling points. However, after surpassing a certain scale, we have identified a few shortcomings it imposes. We are proud to present Thanos, an open source project by Improbable that bundles a set of components that seamlessly transform existing Prometheus deployments, into a unified, global scale monitoring system.

Authors: Fabian Reinartz, Bartlomiej Plotka

Slides from January London Prometheus Meetup 2018.
Thanos: https://github.com/improbable-eng/thanos

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Thanos: Global, durable Prometheus monitoring

  1. 1. Bartek Plotka Bwplotka Bplotka Fabian Reinartz fabxc Global, durable Prometheus monitoring
  2. 2. Prometheus 2.0 ● Reliable operational model ● Powerful query language ● Scraping capabilities beyond the casual usage ● Local metric storage
  3. 3. Prometheus at Scale A dream
  4. 4. n1 Improbable case 2 n+1 ... ● Multiple isolated Kubernetes clusters
  5. 5. n1 Improbable case 2 Prometheus n+1 Prometheus ... ● Multiple isolated Kubernetes clusters ● Single Prometheus server per cluster
  6. 6. 1 Improbable case 2 Prometheus n n+1 Prometheus ... ● Multiple isolated Kubernetes clusters ● Single Prometheus server per cluster ● Dashboards & Alertmanager in separate cluster Grafana Alertmanager
  7. 7. Improbable case ● Multiple isolated Kubernetes clusters ● Single Prometheus server per cluster ● Dashboards & Alertmanager in separate cluster Grafana Alertmanager What is missing? 1 2 Prometheus n n+1 Prometheus ...
  8. 8. Global View See everything from a single place!
  9. 9. 1 Global View 2 Prometheus n n+1 Prometheus ... Grafana Alertmanager “Alert when 66% of clusters in a region are down”
  10. 10. 1 Global View 2 Prometheus n n+1 Prometheus ... ● How to aggregate data from different clusters? Grafana Alertmanager
  11. 11. 1 Global View 2 Prometheus n n+1 Prometheus ... ● How to aggregate data from different clusters? ○ Use hierarchical federation? Grafana Alertmanager Prometheus
  12. 12. 1 Global View 2 Prometheus n n+1 Prometheus ... ● How to aggregate data from different clusters? ○ Use hierarchical federation? Grafana AlertmanagerSingle point of failure Maintenance What data are federated? Prometheus
  13. 13. Availability Where is my sample?
  14. 14. 1 Availability 2 Prometheus n n+1 Prometheus ... Grafana Alertmanager Operator error Hardware failure Rollout
  15. 15. 1 Availability 2 Prometheus n n+1 Prometheus ... ● Can we assure no loss in metric data? Grafana Alertmanager
  16. 16. 1 Availability 2 Prometheus n n+1 Prometheus ... ● Can we assure no loss in metric data? ○ Add HA replicas? Grafana Alertmanager Prometheus Prometheus
  17. 17. 1 Availability 2 Prometheus n n+1 Prometheus ... ● Can we assure no loss in metric data? ○ Add HA replicas? Grafana Alertmanager Prometheus Prometheus What replica we should query? Cost? Where to put rules and alerts?
  18. 18. Historical Metrics What exactly happened X weeks ago?
  19. 19. Metric retention T - 3 years TT - 12 monthsT - 2 years ? “Go back to what happened 6 months ago...”
  20. 20. Metric retention T - 3 years TT - 12 monthsT - 2 years ? “Good progress! Memory allocs looks better than 1,5 year ago!” T X
  21. 21. Metric retention “Let’s see user traffic across years!” T - 2 years T - 12 monthsT - 3 years ? ? ? ? T X
  22. 22. Metric retention Infrastructure retention: 9 days 9 days T - 3 years TT - 12 monthsT - 2 years T
  23. 23. Metric retention ● Can we have longer retention?
  24. 24. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0
  25. 25. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0 ○ Scale SSD Vertically? SSD Prometheus
  26. 26. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0 ○ Scale SSD Vertically? SSD Prometheus
  27. 27. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0 ○ Scale SSD Vertically? SSD Prometheus
  28. 28. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0 ○ Scale SSD Vertically? SSD Prometheus SSD SSD SSD SSD
  29. 29. Metric retention ● Can we have longer retention? ○ Upgrade to Prometheus 2.0 ○ Scale SSD Vertically? SSD Prometheus Backup? Maintenance Cost
  30. 30. Recap 1 2 Prometheus n n+1 Prometheus ... Grafana Alertmanager Prometheus It is just hard to… ● Have a global view ● Have a HA in place ● Increase retention
  31. 31. Thanos It is just hard to… ● Have a global view ● Have a HA in place ● Increase retention
  32. 32. ● Seamless integration with Prometheus ● Easy deployment model ● Minimal number of dependencies ● Minimal baseline cost Additional Goals
  33. 33. Global View See everything from a single place!
  34. 34. SSD Prometheus Prometheus Targets
  35. 35. SSD Sidecar Prometheus Sidecar Targets
  36. 36. SSD Sidecar Prometheus Sidecar Targets gRPC (Store API)
  37. 37. Store API service Store { rpc Series(SeriesRequest) returns (stream SeriesResponse); rpc LabelNames(LabelNamesRequest) returns (LabelNamesResponse); rpc LabelValues(LabelValuesRequest) returns (LabelValuesResponse); } message SeriesRequest { int64 min_time = 1; int64 max_time = 2; repeated LabelMatcher matchers = 3; } Sidecar Prometheus remote read Store API
  38. 38. SSD Querier Prometheus Sidecar Querier Store API Targets HTTP Query API
  39. 39. SSD Global View Prometheus Sidecar Querier Targets SSD Sidecar Targets Prometheus Merge Store API
  40. 40. SSD Global View + Availability Prometheus Sidecar Targets SSD Sidecar Targets Prometheus SSD Sidecar Prometheus “replica”:”1” “replica”:”2” Querier Merge Deduplicate Store API
  41. 41. Thanos It is just hard to… ● Have a global view ● Have a HA in place ● Increase retention
  42. 42. Historical Metrics What exactly happened X weeks ago?
  43. 43. TSDB Layout Block 2 Block 4Block 3Block 1 T-200T-300 T-100 T-50 T
  44. 44. TSDB Layout Block 4Block 3Block 1 chunks chunks chunks chunks index T-200T-300 T-100 T-50 T
  45. 45. SSD Data saving Prometheus Sidecar Targets Object Storage Blocks Blocks Block
  46. 46. Store Object Storage Blocks Cache Store Querier Store API
  47. 47. Store Object Storage Blocks Cache Store Querier Block Store API
  48. 48. Store ● A series is made up of one or more “chunks” ● A chunk contains ~120 samples each ● Chunks can be retrieved through HTTP byte range queries
  49. 49. Store ● A series is made up of one or more “chunks” ● A chunk contains ~120 samples each ● Chunks can be retrieved through HTTP byte range queries Example: ● 1000 series @ 30s scrape interval
  50. 50. Store ● A series is made up of one or more “Chunks” ● A chunk contains ~120 samples each ● Chunks can be retrieved through HTTP byte range queries Example: ● 1000 series @ 30s scrape interval ● Query 1 year 8.7 million chunks/range queries
  51. 51. Store Leverage Prometheus’ TSDB file layout
  52. 52. Store Leverage Prometheus’ TSDB file layout ● Chunks of the same series are aligned sequentially
  53. 53. Store Leverage Prometheus’ TSDB file layout ● Chunks of the same series are aligned ● Similar series are aligned, e.g. due to same metric name
  54. 54. Store Leverage Prometheus’ TSDB file layout ● Chunks of the same series are aligned ● Similar series are aligned, e.g. due to same metric name Consolidating ranges in close proximity reduces request count by 4-6 orders of magnitude. 8.7 million requests turned into O(20) requests.
  55. 55. Store Leverage Prometheus’ TSDB file layout ● Chunks of the same series are aligned ● Similar series are aligned, e.g. due to same metric name Index lookups profit from a similar approach.
  56. 56. Compaction Density matters
  57. 57. Compaction Object Storage Blocks Disk Compactor
  58. 58. Compaction Object Storage Blocks Disk Compactor Blocks
  59. 59. Compaction Object Storage Blocks Disk Compactor Blocks Block
  60. 60. Compaction Object Storage Blocks Disk Compactor Block
  61. 61. Thanos It is just hard to… ● Have a global view ● Have a HA in place ● Increase retention
  62. 62. Downsampling Let’s just step back a little
  63. 63. Downsampling Raw: 16 bytes/sample Compressed: 1.07 bytes/sample
  64. 64. Downsampling BUT…
  65. 65. Downsampling Decompressing one sample takes 10-40 nanoseconds ● Times 1000 series @ 30s scrape interval ● Times 1 year
  66. 66. Downsampling Decompressing one sample takes 10-40 nanoseconds ● Times 1000 series @ 30s scrape interval ● Times 1 year ● Over 1 billion samples, i.e. 10-40s – for decoding alone ● Plus your actual computation over all those samples, e.g. rate()
  67. 67. Downsampling Block RAW Block @ 5m Block @ 1h 10x 12x
  68. 68. Downsampling raw chunk count sum min max counter raw chunk...
  69. 69. Downsampling count sum min max counter ...
  70. 70. Downsampling count sum min max counter count_over_time(requests_total[1h])
  71. 71. Downsampling count sum min max counter sum_over_time(requests_total[1h])
  72. 72. Downsampling count sum min max counter min(requests_total) min_over_time(requests_total[1h])
  73. 73. Downsampling count sum min max counter max(requests_total) max_over_time(requests_total[1h])
  74. 74. Downsampling count sum min max counter rate(requests_total[1h]) increase(requests_total[1h])
  75. 75. Downsampling count sum min max counter requests_total avg(requests_total) ... * avg
  76. 76. Full Architecture Querier SSD Sidecar Prometheus SSD Sidecar Prometheus QuerierQuerier … Compactor Store Bucket
  77. 77. Full Architecture $ thanos sidecar … $ thanos query … $ thanos store … $ thanos compact …
  78. 78. Deployment Models Querier S P QuerierQuerier … Store Bucket S P Querier S P QuerierQuerier … Store Bucket S P Querier S P QuerierQuerier … Store Bucket S P Cluster A Cluster B Cluster C
  79. 79. Deployment Models Querier S P QuerierQuerier … Store Bucket S P Querier S P QuerierQuerier … Store Bucket S P Querier S P QuerierQuerier … Store Bucket S P Cluster A Cluster B Cluster C Federation (through Store API)
  80. 80. Deployment Models Querier S P QuerierQuerier … Store Bucket S P S P … Store Bucket S P S P … Store Bucket S P Cluster A Cluster B Cluster C Global Scale Thanos Cluster
  81. 81. Cost ● Store + Query node ~ Savings on Prometheus side (+/- 0) ● Fewer SSD space on Prometheus side (savings) ● Basically: just your data stored in S3/GCS/HDFS + requests
  82. 82. Cost Example: ● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk ● 20 x 250GB of new data per month + ~20% overhead for downsampling ● $1440/month for storage after 1 year (72TB of queryable data) ● $100/month for sustained 100 query/sec against object storage Thanos Cost: $1540
  83. 83. Cost Example: ● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk ● 20 x 250GB of new data per month + ~20% overhead for downsampling ● $1440/month for storage after 1 year (72TB of queryable data) ● $100/month for sustained 100 query/sec against object storage ● $1530/month savings in local SSDs Thanos Cost: $1540 Prometheus Savings: $1530
  84. 84. Demo - retention
  85. 85. Demo - deduplication
  86. 86. Demo - deduplication
  87. 87. Any questions? github.com/improbable-eng/thanos Fabian Reinartz fabxc Bartek Plotka bwplotka Bplotka
  • maxmonlt

    Mar. 3, 2021
  • ssuser8036fe

    Feb. 25, 2021
  • wangxiaopeng65

    May. 10, 2020
  • ssuser6d57fa

    Feb. 10, 2020
  • NarendranathPanda

    Nov. 26, 2019
  • gnap

    Aug. 23, 2019
  • kerneljack

    Jun. 14, 2019
  • TaopaiC

    Jun. 14, 2019
  • jtrmnwu

    May. 31, 2019
  • ssusere5b495

    Apr. 11, 2019
  • pilgrim.in.rails

    Apr. 4, 2019
  • deniszh

    Jan. 2, 2019
  • kewin2010

    Dec. 27, 2018
  • jinsumoon33

    Dec. 27, 2018
  • zihado

    Nov. 7, 2018
  • tivvit

    Nov. 6, 2018
  • Jied1

    Nov. 1, 2018
  • ctyeh

    Sep. 3, 2018
  • dreampuf

    Aug. 16, 2018
  • ChungYiChen

    Aug. 15, 2018

Prometheus’s simple and reliable operational model is one of its major selling points. However, after surpassing a certain scale, we have identified a few shortcomings it imposes. We are proud to present Thanos, an open source project by Improbable that bundles a set of components that seamlessly transform existing Prometheus deployments, into a unified, global scale monitoring system. Authors: Fabian Reinartz, Bartlomiej Plotka Slides from January London Prometheus Meetup 2018. Thanos: https://github.com/improbable-eng/thanos

Views

Total views

23,763

On Slideshare

0

From embeds

0

Number of embeds

337

Actions

Downloads

318

Shares

0

Comments

0

Likes

35

×