Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cortex: Horizontally Scalable, Highly Available Prometheus

170 views

Published on

In this talk we present Cortex - a horizontally scalable, highly available Prometheus implementation. Like Prometheus, Cortex is a CNCF (sandbox) project.

Cortex turns a lot of the Prometheus architectural assumptions on its head, by marrying a scale-out PromQL query engine with a storage layer based on NOSQL databases such as Bigtable, DynamoDB and Cassandra. We have disaggregated the Prometheus binary into a microservices-style architecture, with separate services for query, ingest, alerting and recording rules. By designing all these services as fungible replicas, this solution can be scaled out with ease and failure of any individual replica can be dealt with gracefully.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cortex: Horizontally Scalable, Highly Available Prometheus

  1. 1. Cortex: Horizontally Scalable, Highly Available Prometheus Tom Wilkie, Nov 2018 @tom_wilkie
  2. 2. Prometheus • A monitoring & alerting system. • Inspired by Google’s BorgMon • Originally built by SoundCloud in 2012 • Open Source, now part of the CNCF • Simple text-based metrics format • Multidimensional datamodel • Rich, concise query language
  3. 3. Cortex • Horizontally scalable Prometheus • Distributed, fault tolerant architecture • Long term storage • Multitenant github.com/cortexproject/cortex
  4. 4. 16/06/2016 First design doc 25/08/2016 PromCon 2016 talk 25/10/2016 Renamed to Cortex 23/01/2017 Support for Recording Rules & Alerts 13/07/2017 BigTable support added 18/08/2017 PromCon 2017 talk 08/02/2018 Cassandra support added 20/09/2018 Join CNCF Sandbox http://goo.gl/prdUYV
  5. 5. >2 million samples/s >100 million timeseries Adopters Users
  6. 6. Community • Commits from 37 contributors, spanning ~6 companies. • Apache 2 license. • Community mailing list + ~fortnightly call since Feb 2018. • Establishing governance based on CNI.

  7. 7. Horizontally Scalable Highly Available Long Term Storage Multitenant
  8. 8. Horizontally Scalable
  9. 9. Prometheus Scaling Your JobsYour JobsYour JobsYour JobsYour Apps Your JobsYour JobsYour JobsYour JobsYour Apps Scale Up Your JobsYour JobsYour JobsYour JobsYour Apps Your JobsYour JobsYour JobsYour JobsYour Infra Manually Shard
  10. 10. Cortex
 Distributor Cortex
 Ingester Cortex
 Ingester Cortex
 Ingester Cortex
 Ingester s Cortex Scaling: Distributed Hash Table hash(s) 0 16 32 48
  11. 11. us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View Can configure multiple datasource in Grafana… …but then only see data for one Prometheus at a time.
  12. 12. us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View II “global” Prometheus Can configure a “global” Prometheus to federate samples from “local” Prometheus…. …but in practice only propagate aggregates, have to preconfigure rules, hard to scale etc.
  13. 13. us-central1 eu-west2 Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Your Jobs Your Jobs Your Jobs Your Jobs Your Apps Global View III “global” Cortex Or can push all data to a central Cortex cluster. Cortex horizontal scalability allows it to scale to handle all the raw samples.
  14. 14. Highly Available
  15. 15. Prometheus HA Your JobsYour JobsYour JobsYour JobsYour Apps AlertmanagerAlertmanager
  16. 16. Cortex HA: Dynamo-style replication Cortex
 Ingester Cortex
 Ingester Cortex
 Ingester Cortex
 Distributor s Distributor replicates samples on ingest. Waits for N/2 ACKs from ingesters to ensure consistency. Cortex
 Querier s Querier de-dupes samples on read - again, only waiting for N/2 responses.
  17. 17. Long Term Storage
  18. 18. durability /dʒɔːrəˈbɪlɪti/ noun 1. the ability to withstand wear, pressure, or damage. “the reliability and durability of plastics"
  19. 19. Durability is hard… AWS DynamoDB Google Cloud Bigtable Apache Cassandra …let someone else deal with it.
  20. 20. • Why not just write the samples straight to the NOSQL DB? • By building & flushing chunks, Cortex acts as a “write deamplifier”, massively reducing cost. • The NOSQL DBs also don’t necessarily support the right indexes for executing PromQL queries. Cortex adds these. s 30k samples/s 450k series ~10 IOPs
  21. 21. Multitenant
  22. 22. Pod-per-tenant s Auth / Frontend … Automated Provisioning ` Multitenant s Auth / Frontend Natively multi tenant services handle different users within the same process
  23. 23. Pod-per-tenant Multitenant Pros • No application modifications necessary. • Effectively zero change of “leakage” between tenants. Cons • Cattle-not-pets • Provisioning automation hides a lot of complexity… Pros • Per-tenant marginal costs can be close to zero • Can take advantage of statistical multiplexing. • Reduced provisioning complexity can be traded for more “interesting” architecture. Cons • Takes work…
  24. 24. Horizontally Scalable Highly Available Long Term Storage Multitenant
  25. 25. • PromCon 2016 talk • KubeCon 2016 talk • PromCon 2017 talk
 • Original design doc • CNCF TOC Presentation • Amazon’s Dynamo Paper More Reading
  26. 26. Get Involved! github.com/cortexproject/cortex #cortex on slack.cncf.io @tom_wilkie, tom.wilkie@gmail.com
  27. 27. + Grafana Cloud is a hosted and fully managed SaaS metrics platform that helps Ops and Dev teams using Grafana to understand the behavior of their applications and infrastructure Grafana Cloud allows users to provision and manage the best open source observability tools - Grafana and Prometheus - all through a simple UI and single API. What is Grafana Cloud? Store, visualize and alert without the headache of scaling or managing your own monitoring stack. Your complete, fully managed, hosted metrics platform. Grafana Cloud:

×