Grafana overview deck - Tech - 2023 May v1.pdf

Grafana Labs
Overview
Zoe Wang
Senior Enterprise AM
zoe.wang@grafana.com
+65 91066139

Grafana Labs’ big tent strategy

First Pane of Glass & Collaboration through “Big Tent”
The 10000-foot view
For who? C-level and executives
For what? Centralised view of your most critical KPIs (costs, incidents, etc) and global health of the
system
The Platform view
For who? Platform owners and Ops team
For what? Visualise individual SLOs/SLIs and detect which component of the platform
is impacting your SLA
The Service view
For who? SREs and DevOps
For what? Monitor key signals for a specific service, and start your
exploration/debugging/troubleshooting workflow.

Open source is at the heart of what we do
Employ 91% of the
Loki team
members, including
project founders
Employ 89% of
Grafana team
members, including
project founders
Employ 100% of
the Tempo team
members, including
project founders
Employ 100% of
the Mimir team
members, including
project founders
Employ 100% of k6
team members,
including the project
founders
Employ 44% of the
Prometheus team
members
The leading
contributors to the
Graphite project
Employ contributors,
including a
Governance
Committee member
Employ 100% of
OnCall team
members, including
the project
founders
1,000+ Employees across
40+ countries 10M+ Users across OSS
and Cloud Free tier
Employ 100% of
Phlare team
members, including
the project
founders
Employ 100% of
Faro team
members, including
the project
founders
1M+
Instances across
LGTM Cloud and
OSS

What We Built : A Composable Observability Stack
OSS / Cloud / Enterprise
Observability-as-code
Open APIs + Webhooks + Terraform, Ansible
Security + Scale + Support
OnCall Incident
SLO Alerting
Visualize
Business, Application, and Infrastructure
Prevent
Performance Testing
Load Testing Metrics Logs Traces
Enterprise Plugins
+100s more
Community Plugins
Find
Telemetry
Act
Incident Response & Management
Browser Testing Profiles

Simplified architecture across Mimir Loki Tempo

LGTM
Open
Big tent
Composable +

Linux
MySQL
Apache
App
Aggregation Visualization Alerts
Dashboards
3 4
5
6
Collection
Exporters
1 2

Grafana Agent – Make your pipeline easy
Metrics - Based on Prometheus Agent with
embedded exporters.
Logs - Embeds Promtail, the log forwarder
built by Grafana, for Loki.
Traces - Based on Open Telemetry
Collector.
Universal | Feature-rich | Open

Enterprise
Plugins
Advanced
Auth
Datasource
Permissions
Query
Caching
Role Based
Access Control
Recorded
Queries
Custom Training
& Workshops
White Labelling
Unlimited Expert
Support
Security
Auditing
Encryption
LDAP & Team
Synch
Vault Integration
Reporting
Usage
Insights
Grafana
Enterprise
Grafana Enterprise / Grafana Cloud Features
Machine Learning
On Call & Incident
Management
Integrations
Ready to use
dashboards & alerts

Grafana Enterprise Backends Features
Influx Ingestion &
Query
Datadog Ingestion
& Query
Label Based
Access Control
Cross Cluster
Federation
Cardinality
Management
Unlimited Expert
Support
Graphite Ingestion
& Query
Authentication
OIDC - JWT
Tenant
Management
Grafana
Metrics
Grafana
Logs
Grafana
Traces

SLA / SLO / SLI Management
Business Scorecards
Full Stack Monitoring Developer Productivity
USE (Utilization, Saturation, Errors), RED (Rate,
Errors, Duration)
Apps, services, infrastructure, databases, etc
# users, # orders, # transactions, availability, #
incidents, delivery velocity, revenue, etc
Build time, delivery time, # deploys, % changes
fail, # bugs, waiting time, code quality
Code quality monitoring On Call & Incident Management
# incident, # tests, performance test metrics,
application metrics
# incident, # alerts, escalation chains, incident
management process, outlier detections

The Big Tent Philosophy - Single Pane of Glass
Go deeper with component-
level dashboards or deep link
into the data source

Getting Started – What’s your option?
Build your own stack
Self-hosted
Let experts handle it
Grafana Cloud
Deploy LGTM stack on your own
data center or cloud capacity.
Scale, maintain, and upgrade it
yourself.
Automatic scaling, patches, and
instant upgrades.
Sign-up and ready to go.
Exclusive features &
Opinionated solutions.
Integrations, K8s app, ML, OnCall,
Incident, …

The Reality of
Doing It Yourself
aka.
self-managed
What you expect
Hidden operational costs
● Enterprise Observability
● Single pane of glass for optimized operational
performance
● Provision servers
● Capacity planning
● Configure HA
● Configure Security
● Customize plugins/API scripting
● Ongoing Upgrades and Maintenance
● Constant maintenance takes you away from more
important tasks
● Retaining tribal knowledge “how did we get here?”

What we are building with Grafana Cloud
Opinionated Completeness Cost Visibility
Time–to-value with Grafana
opinionated solutions
Cost insights and
optimizations
Full stack advantage with
Metrics, Logs, Traces and
others
*logs cardinality mgmt is
under development
Incident Response
& Management
ML & Synthetic
Performance
Testing
>50 Integrations
*upcoming: SLO management,
App O11Y

>50 observability integrations
✅Getting Started
✅Dashboards
✅Alerts

Application Observability Very unofficial wireframes
Public Preview
Frontend Observability
powered by

Grafana ML
Produce high-quality forecasts and adaptive
alerts
✓ Prediction results are built from standard queries and results are exposed
via PromQL.
✓ Users can be alerted when observed values differ from prediction. Dynamic
threshold can be set based on prediction confidence.
Detect anomalies in real-time
✓ Identify abnormal patterns using outliers detector.
✓ Alert your users instantaneously when outliers are detected.
Support various data sources
✓ Currently supported: Prometheus, Loki, Postgres, InfluxDB, BigQuery,
Snowflake and Datadog.

Correlation by design
Exemplars
Labels & service discovery
CPU{service="foo", region="eu-west",
node="node-123",cluster="bar"}=X
Auto-generated
metrics
Logs for trace
TraceID in logs
Logs to metrics
extraction
2023-01-01 {service=”foo”,cluster=”bar”} LOG CONTENT

API
Logs Traces
Metrics
Alerts Synthetics
Reports Correlate
Pre-built Dashboards
& Alerts
Usage Insights
+100
On Call Teams &
Escalation Chains
Management
Incident
Management
Forecast
Outlier Detection
On premise
Public Cloud Microservices
Infrastructure Applications
Event Brokers
Dashboards
Plugins
…
Data Ingestion
with Agents &
Clients
Reducing
Current Spend
Improving
Reliability
Improving
Reliability
Unify and Correlate
Your Data

PromQL LogQL
Lookup &
search
(TraceQL*)
Promtail
Unified platform and format agnostic

compactor
compactor
µservices architecture
ingester
query-frontend
querier
store-gateway
distributor
compactor
Object storage
Reads Writes
compactor
compactor
Specific service
(optional)
SSD storage
SSD storage
SSD storage
Highly scalable.
Lower TCO at scale.
No down sampling required.

Run at any scale
Grafana Mimir can virtually scale to infinity
We have scaled it to reach 1 billion active series with a 20 second scrape
Learn more: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/

2019-12-11T10:01:02.123456789Z {app=”nginx”,cluster=”us-west1”} GET /about
Timestamp
with nanosecond precision
Content
log line
Prometheus-style Labels
key-value pairs
indexed unindexed
How Loki stores & indexes logs

10TB 200MB
Log Data Index
Think of it more like a table of contents than an index
Loki does not index the text of logs. Instead, entries are grouped into
streams and indexed with Prometheus-style labels.
Efficient logging

1PB 80TB
Raw Logs Timeframe Brute force
search - heavily
parallelized
Label selector
1TB 120GB+/s
Fast queries
{ cluster=”us-central1”,
job=~”nginx*” }
|= “needle in
the haystack”

The better tradeoff
● Log any and all formats
● Smaller indexes
● Cost effective resource usage
● Fast enough queries for SRE
● Cut and slice your logs in dynamic ways - ask
new questions
Grafana Loki
Query time processing
VS
● Decide on log formats aka “common scheme”
● Large indexes
● More expensive to run
● Faster queries
● Restricted to format chosen at ingestion time
Content indexing
Upfront / ingest time processing

Incident response process
Grafana Alerting (WHAT)
- Configure alerts on
metrics
Grafana OnCall (WHO)
- Schedules
- Escalation policies
- Notifications to wake
you up
Grafana Incident (WHY)
- Declare incident
- Assign roles
- Manage tasks
- Put the fire out

Grafana Incident Response is unified and integrated
Grafana Alerting Grafana OnCall Grafana Incident

Grafana OnCall is a new on-call management tool that’s available in Grafana Cloud.

About Grafana k6
● k6 is a load and performance
testing tool including k6 OSS and
k6 Cloud
● Leader in modern load testing of
APIs, microservices, and websites
● Joined the Grafana Labs family in
June 2021
● Shift testing and observability left
Pre-production
(proactive)
Production
(reactive)
Virtual User
traffic
Real User
traffic
Prod
Pre-prod
Software Development Life Cycle

What use cases does k6 cover ?
Primary Use Cases Secondary Use Cases
Load Testing
k6 is optimized for minimal resource consumption and
designed for running high load tests (spike, stress, soak
tests).
K6 is used to test the performance and reliability of
APIs, microservices, and websites.
Specifically this means you can:
- Load test your backends - APIs and
microservices.
- Get a complete view of your website user
experience. With xk6-browser (beta), users can
mix backend load testing and frontend browser
testing in the same script for end-to-end website
testing.
Chaos Testing/Failure Injection Testing
k6 can be used as part of chaos experiments. Chaos
engineering is mostly done in a pre-production
environment or production during quiet traffic. It means
that you need a load testing tool to simulate real traffic
during chaos experiments.
Synthetic Monitoring
With k6, you could run tests with a small amount of load
to continuously validate the performance and availability
of your production environment.

Modern teams are shifting performance testing left
Testing frequency
Release frequency
What to test
How is performance testing
done
Who is responsible for
performance testing
OLD WAY
Quarterly or biannually
QA
User stories, high-risk components
Manually
Before releases
NEW WAY
Weekly
Developers, QA/SDET, SRE
Continuously: nightly, in feature branches,
when infra changes, before releases, in prod
User stories, high-risk components, services,
infrastructure, unexpected failures
Automatically as part of CI/CD

Grafana overview deck - Tech - 2023 May v1.pdf

More Related Content

What's hot

Similar to Grafana overview deck - Tech - 2023 May v1.pdf

Recently uploaded

In this document

Grafana overview deck - Tech - 2023 May v1.pdf