Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021

InfluxData
InfluxDataInfluxData
1. What is Nobl9 and why does it use InfluxDB?
2. Features of Nobl9 supported by InfluxDB
3. Lessons learned and challenges going forward
How Not to Build an SLO Platform
AGENDA
Alex Nauda
CTO, Nobl9
Twitter @alexnauda
Email alex@nobl9.com
nobl9.com / @nobl9inc / hello@nobl9.com
Nobl9 Architecture - Black Box View
Error
Budgets
Web
App
API
InfluxDB
PM & Business
Stakeholders
YAML
GUI
A
l
e
r
t
P
r
i
o
r
i
t
i
z
e
Raw SLIs
SLO
Config
Ops/SREs &
Application Leads
Govern
Align
New Relic Prometheus
Datadog
Calculations
Customer Platforms and Services
App
CI/CD Web Services
Data
GitOps
SLO Based
Events &
Alerts
Graphs
Reports
Review &
Align
Review &
Align
Why did we consider InfluxDB in the first place?
Query-friendly Time Series Database: Flexible query
capabilities to drive all our SLO charts and graphs
Deployment options: Cloud offering for our SaaS
platform; Enterprise for self-hosting customers; OSS for
dev env… with good query compatibility across them all
Commercial support: Firm requirement both for us
(managing our SaaS offering) and our customers
(self-hosting and managing Nobl9 including InfluxDB)
InfluxDB is useful across many of the core features of Nobl9
Calculation of SLO time series
Alerting on SLO time series
Data Intake
Data Export
Graphs and Reports
Nobl9 Feature Set
Receiving telemetry data from a variety of sources
Processing of telemetry data (SLIs) and math
A variety of visualizations, real-time and historical
Notify various integrations based on configuration
Real-time and batch exports to other tools
Data Intake
Telemetry Requirements
Receiving telemetry data from a variety of sources
● Support a wide variety of data sources -- 15 and counting
○ Metrics systems
○ APM
○ RUM
○ Cloud platform built-in metrics
○ Log aggregation
○ Synthetics
○ Data warehouses
● Integrate via agent (self-hosted sidecar) as well as direct connection (SaaS-to-SaaS)
● Adapt to a wide variety of integration paradigms
○ API, Query, push or pull, various authentication mechanisms
● Be robust in the face of connectivity issues and operation across the internet and other networks
● Conform to various security models at large companies (for example, support web proxies BTF)
● Configuration-based telemetry
○ Integrate well with our SLOs-as-code paradigm
○ Customers apply changes using our web UI, CLI, terraform provider, k8s operator
Coming Soon
Prometheus
Server
Public Internet
Customer’s Environment
AWS WAF
Nobl9 Intake
Service
Nobl9 / AWS Cloud
m2m
Authentication
Nobl9 Agent
Prometheus does not
support authentication
directly.
Some users put Prom
behind NGINX with HTTP
Basic Auth or Client Cert
Auth. N9 Agent doesn’t
support any
authentication.
Data source credentials are
proved by customer to the N9
Agent as environment
variables. Credentials are not
sent to Nobl9
N9 Agent executes queries
against metric data sources
on defined interval using
the environment credentials
N9 Agent pools the N9 Intake
service to receive the latest
configuration.
N9 Agent pushes data to N9.
N9 Intake service can only
handle numeric float data
types.(N9 Intake cannot receive
or store PII)
Direct
Nobl9 Agent and Direct Connection Architecture
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Data Intake
Telemetry Architecture
Receiving telemetry data from a variety of sources
● Based on Telegraf
○ High quality data pipeline utility
○ Widely adopted, strong community
● Extended in-house to meet our specific requirements
○ Proprietary input and output plugins
○ Doesn't send directly to InfluxDB
○ Reports data to our Data Intake REST API
○ Dynamically reloads configuration after phoning home
● Direct connection (SaaS-to-SaaS) is special and a bit different
○ But still Telegraf is a component of it
Calculation of SLO time series
Calculation Requirements
Processing of telemetry data (SLIs) and math
● Calculate up-to-the-minute SLOs as data arrives
● Support a wide variety of SLO features
○ Rolling windows and Calendar-aligned windows
○ Ratio metrics as well as Threshold metrics
○ Occurrences-based calculation vs time slice-based calculation
Calculation Design
● Original version was built in InfluxDB
○ Huge prototyping win!
○ Used InfluxQL (but could have been done in Flux)
○ Queries were really intense
● Would have to scale vertically
○ Calculating SLOs repeatedly, on the fly, is intense
○ This would be a massive database
■ Calculations are memory intensive
■ Longer SLO time windows cost more
○ Add in requirements for HA, DR… a vertically scaled
database solution is not ideal
When Your Architecture
Requires a
Vertically Scaled Database
Calculation of SLO time series
Calculation Architecture
Processing of telemetry data (SLIs) and math
● Rearchitected into custom code and Kafka
○ FIFO calculation approach
○ Maintains state, uses object storage as a
backing store
○ Scales horizontally
Graphs and Reports
Query Requirements
A variety of visualizations, real-time and historical
● Display up-to-the-minute data as values change (as new telemetry data arrives)
● Report over longer time scales as well -- over a year
● Allow users to seek through the data with a time window selector
● Support a multitude of SLOs running at once, and chart them
● Provide a wide variety of visualizations
○ SLO detail view
○ SLO grid view (list)
○ Various historical reports
○ Summaries such as Service Health Dashboard
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
Graphs and Reports
Query Architecture
A variety of visualizations, real-time and historical
● InfluxDB underlies all of this
○ All in Flux now
○ Flexibility is sufficient for a wide variety of creative uses
● Data granularity (resolution) is sometimes challenging in our use cases
○ We downsample data to hourly to display on longer time range graphs
○ We retain all the data in addition to the downsampled summary
○ Downsampling is done with InfluxDB Tasks
■ Requires some consideration for compatibility across InfluxDB codebases
Alerting on SLO time series
Alerting Requirements
Notify various integrations based on configuration
● Alert on configurable conditions based on SLO time series
○ Burn rate conditions
○ Error budget exhausted or partly exhausted
● Support a wide variety of alert methods and destinations
Alerting on SLO time series
Alerting Architecture
Notify various integrations based on configuration
● Similar architecture to calculations
○ Custom Go code
○ Hanging off the same Kafka bus as
calculations
● Requirements on the alert method integration side
are the big driver
○ Integrate with APIs of integrated alert
methods
○ Webhooks both tool-specific and a rich
custom webhook
Data Export
Data Export Requirements
Real-time and batch exports to other tools
● Batch data export
○ Export delimited files to cloud object storage or fs
○ Import to popular data lake tooling
■ Snowflake
■ Big Query
● Real-time data export
○ Display SLO time series alongside other metrics
○ Incorporate SLO data in existing dashboards and
visualizations
Data Export
Data Export Architecture
Real-time and batch exports to other tools
● Batch data export: custom
○ Manage authentication within and
across clouds/hosting
○ Manage timing and performance of
export jobs
○ Integrate with import to preferred
query system
● Real-time requirements
○ Self-hosted customers
■ Use their dedicated InfluxDB
■ Use Chronograf
■ Wire InfluxDB up to something
else if they want
○ SaaS customers
■ Real-time data feed in
development now
■ Will support popular
destinations (another InfluxDB)
Faceted SLOs
● Present a global SLO for a given SLI
● Provide the ability to drill down into
that SLO along various dimensions
Higher Cardinality
● This data looks more like
observability platform data
● Might consider a columnar database
or other similar data stores
Possible InfluxDB option?
● We will be watching IOx closely to
see if it could meet our needs here
Network /
ASN
Region /
Data Center
Individual
User
Geographic
Location
Feature /
User Journey
Client
Platform
& Version
B2C Customer
AZ / subnet
Challenges Going Forward
SLI
Data
Faceting
High Cardinality
Thank You
Alex Nauda
CTO, Nobl9
Twitter @alexnauda
Email alex@nobl9.com
1 of 26

More Related Content

What's hot(20)

Similar to Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021(20)

Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open735 views
An introduction to cloud systems architectureAn introduction to cloud systems architecture
An introduction to cloud systems architecture
Neela Muhil Vannan Mayavannan319 views
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya932 views
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
IBM_Info_Management5.3K views
Monitoring Your Business with WSO2 BAMMonitoring Your Business with WSO2 BAM
Monitoring Your Business with WSO2 BAM
Anjana Fernando17 views
Ghost EnvironmentGhost Environment
Ghost Environment
PratipD3 views
Dynomite @ RedisConf 2017Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou635 views
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod128 views
App Deployment on CloudApp Deployment on Cloud
App Deployment on Cloud
Ajey Pratap Singh396 views

More from InfluxData(20)

Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData100 views

Recently uploaded(20)

The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet49 views

Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021

  • 1. 1. What is Nobl9 and why does it use InfluxDB? 2. Features of Nobl9 supported by InfluxDB 3. Lessons learned and challenges going forward How Not to Build an SLO Platform AGENDA Alex Nauda CTO, Nobl9 Twitter @alexnauda Email alex@nobl9.com
  • 2. nobl9.com / @nobl9inc / hello@nobl9.com
  • 3. Nobl9 Architecture - Black Box View Error Budgets Web App API InfluxDB PM & Business Stakeholders YAML GUI A l e r t P r i o r i t i z e Raw SLIs SLO Config Ops/SREs & Application Leads Govern Align New Relic Prometheus Datadog Calculations Customer Platforms and Services App CI/CD Web Services Data GitOps SLO Based Events & Alerts Graphs Reports Review & Align Review & Align
  • 4. Why did we consider InfluxDB in the first place? Query-friendly Time Series Database: Flexible query capabilities to drive all our SLO charts and graphs Deployment options: Cloud offering for our SaaS platform; Enterprise for self-hosting customers; OSS for dev env… with good query compatibility across them all Commercial support: Firm requirement both for us (managing our SaaS offering) and our customers (self-hosting and managing Nobl9 including InfluxDB)
  • 5. InfluxDB is useful across many of the core features of Nobl9 Calculation of SLO time series Alerting on SLO time series Data Intake Data Export Graphs and Reports Nobl9 Feature Set Receiving telemetry data from a variety of sources Processing of telemetry data (SLIs) and math A variety of visualizations, real-time and historical Notify various integrations based on configuration Real-time and batch exports to other tools
  • 6. Data Intake Telemetry Requirements Receiving telemetry data from a variety of sources ● Support a wide variety of data sources -- 15 and counting ○ Metrics systems ○ APM ○ RUM ○ Cloud platform built-in metrics ○ Log aggregation ○ Synthetics ○ Data warehouses ● Integrate via agent (self-hosted sidecar) as well as direct connection (SaaS-to-SaaS) ● Adapt to a wide variety of integration paradigms ○ API, Query, push or pull, various authentication mechanisms ● Be robust in the face of connectivity issues and operation across the internet and other networks ● Conform to various security models at large companies (for example, support web proxies BTF) ● Configuration-based telemetry ○ Integrate well with our SLOs-as-code paradigm ○ Customers apply changes using our web UI, CLI, terraform provider, k8s operator
  • 8. Prometheus Server Public Internet Customer’s Environment AWS WAF Nobl9 Intake Service Nobl9 / AWS Cloud m2m Authentication Nobl9 Agent Prometheus does not support authentication directly. Some users put Prom behind NGINX with HTTP Basic Auth or Client Cert Auth. N9 Agent doesn’t support any authentication. Data source credentials are proved by customer to the N9 Agent as environment variables. Credentials are not sent to Nobl9 N9 Agent executes queries against metric data sources on defined interval using the environment credentials N9 Agent pools the N9 Intake service to receive the latest configuration. N9 Agent pushes data to N9. N9 Intake service can only handle numeric float data types.(N9 Intake cannot receive or store PII) Direct Nobl9 Agent and Direct Connection Architecture
  • 11. Data Intake Telemetry Architecture Receiving telemetry data from a variety of sources ● Based on Telegraf ○ High quality data pipeline utility ○ Widely adopted, strong community ● Extended in-house to meet our specific requirements ○ Proprietary input and output plugins ○ Doesn't send directly to InfluxDB ○ Reports data to our Data Intake REST API ○ Dynamically reloads configuration after phoning home ● Direct connection (SaaS-to-SaaS) is special and a bit different ○ But still Telegraf is a component of it
  • 12. Calculation of SLO time series Calculation Requirements Processing of telemetry data (SLIs) and math ● Calculate up-to-the-minute SLOs as data arrives ● Support a wide variety of SLO features ○ Rolling windows and Calendar-aligned windows ○ Ratio metrics as well as Threshold metrics ○ Occurrences-based calculation vs time slice-based calculation
  • 13. Calculation Design ● Original version was built in InfluxDB ○ Huge prototyping win! ○ Used InfluxQL (but could have been done in Flux) ○ Queries were really intense ● Would have to scale vertically ○ Calculating SLOs repeatedly, on the fly, is intense ○ This would be a massive database ■ Calculations are memory intensive ■ Longer SLO time windows cost more ○ Add in requirements for HA, DR… a vertically scaled database solution is not ideal When Your Architecture Requires a Vertically Scaled Database
  • 14. Calculation of SLO time series Calculation Architecture Processing of telemetry data (SLIs) and math ● Rearchitected into custom code and Kafka ○ FIFO calculation approach ○ Maintains state, uses object storage as a backing store ○ Scales horizontally
  • 15. Graphs and Reports Query Requirements A variety of visualizations, real-time and historical ● Display up-to-the-minute data as values change (as new telemetry data arrives) ● Report over longer time scales as well -- over a year ● Allow users to seek through the data with a time window selector ● Support a multitude of SLOs running at once, and chart them ● Provide a wide variety of visualizations ○ SLO detail view ○ SLO grid view (list) ○ Various historical reports ○ Summaries such as Service Health Dashboard
  • 20. Graphs and Reports Query Architecture A variety of visualizations, real-time and historical ● InfluxDB underlies all of this ○ All in Flux now ○ Flexibility is sufficient for a wide variety of creative uses ● Data granularity (resolution) is sometimes challenging in our use cases ○ We downsample data to hourly to display on longer time range graphs ○ We retain all the data in addition to the downsampled summary ○ Downsampling is done with InfluxDB Tasks ■ Requires some consideration for compatibility across InfluxDB codebases
  • 21. Alerting on SLO time series Alerting Requirements Notify various integrations based on configuration ● Alert on configurable conditions based on SLO time series ○ Burn rate conditions ○ Error budget exhausted or partly exhausted ● Support a wide variety of alert methods and destinations
  • 22. Alerting on SLO time series Alerting Architecture Notify various integrations based on configuration ● Similar architecture to calculations ○ Custom Go code ○ Hanging off the same Kafka bus as calculations ● Requirements on the alert method integration side are the big driver ○ Integrate with APIs of integrated alert methods ○ Webhooks both tool-specific and a rich custom webhook
  • 23. Data Export Data Export Requirements Real-time and batch exports to other tools ● Batch data export ○ Export delimited files to cloud object storage or fs ○ Import to popular data lake tooling ■ Snowflake ■ Big Query ● Real-time data export ○ Display SLO time series alongside other metrics ○ Incorporate SLO data in existing dashboards and visualizations
  • 24. Data Export Data Export Architecture Real-time and batch exports to other tools ● Batch data export: custom ○ Manage authentication within and across clouds/hosting ○ Manage timing and performance of export jobs ○ Integrate with import to preferred query system ● Real-time requirements ○ Self-hosted customers ■ Use their dedicated InfluxDB ■ Use Chronograf ■ Wire InfluxDB up to something else if they want ○ SaaS customers ■ Real-time data feed in development now ■ Will support popular destinations (another InfluxDB)
  • 25. Faceted SLOs ● Present a global SLO for a given SLI ● Provide the ability to drill down into that SLO along various dimensions Higher Cardinality ● This data looks more like observability platform data ● Might consider a columnar database or other similar data stores Possible InfluxDB option? ● We will be watching IOx closely to see if it could meet our needs here Network / ASN Region / Data Center Individual User Geographic Location Feature / User Journey Client Platform & Version B2C Customer AZ / subnet Challenges Going Forward SLI Data Faceting High Cardinality
  • 26. Thank You Alex Nauda CTO, Nobl9 Twitter @alexnauda Email alex@nobl9.com