2. Who are we?
Michael Goodness (@opsgoodness)
● Systems Architect & Tech Lead,
Kubernetes/Cloud Native
● 105 weeks @ Ticketmaster
● 3 years prod experience w/ K8s
● 2.5 years prod experience w/ Prometheus
Abraham Ingersoll (@aberoham)
● Solutions Engineering @ Gravitational
● K8s-as-a-Platform + ZeroTrust
SSH/kubectl
● 145 weeks @ Ticketmaster, ~70 weeks
(and counting) @ an enterprise software
vendor
● Relegated to talking to teams about their
production environments rather than
directly playing with them
3. Late 90’s Early 2000’s
Late 90’s
MRTG → RRDTool
2002
Nagios, Ganglia, Cacti
2003-2005
Nagios Plugins, Remote
Execution, Database
Backends
2005+
Commercial Nagios Clones,
Splunk
The “Early” Evolution
4. - Mid ‘90s perl, used ASCII files to
monitor university internet links
- Leveraged management protocols
to automatically discover
counters & gauges
- Used to prove to management
that larger links were needed
- Released as source to “give back”
5. Notable
Fixed Sizing & Archival
Clean C library, GPL License
Predictable Performance
Still v1.x Today!
“Basically MRTG done
right" - Tobi, July 1999
6. Late 90’s Early 2000’s
Late 90’s
MRTG → RRDTool
2002
Nagios, Ganglia, Cacti
2003-2005
Nagios Plugins, Remote
Execution, Database
Backends
2005+
Commercial Nagios Clones,
Splunk
The “Early” Evolution
7. - Network Admin @ University of
Minnesota “scratching an itch”
- Grandfather and inspiration of
many others
- Single-maintainer for a very long
time
Key Wins
Simple “Core”
Plugin System
Open Source,
Extensible
9. Late 90’s Early 2000’s
Late 90’s
MRTG → RRDTool
2002
Nagios, Ganglia, Cacti
2003-2005
Nagios Plugins, Remote
Execution, Database
Backends
2005+
Commercial Nagios Clones,
Splunk
The “Early” Evolution
10. - Top-down procurement (7+
figures!) to replace Nagios
- Windows-based, with neat
message bus and kitchen sink of
features
- Purchase included promise to add
Linux support/integration
Why o’ Why?
Single Pane of
Glass
Windows GUI
Vendor Support
12. The “Web 2.0” Era
2008+ ~2010 to ~2016
2008+
Graphite, StatsD
RRDTool → Whisper
April 2010
OpenTSDB
2012
Splunk IPO
2015
Prometheus Announced
13. - Great graphing features that
allowed for rapid prototyping &
discovery
- Replaced RRDTool with simpler TS
DB that allowed “irregular
updates” and historical data
import
Key Wins
Fantastic
Graphing
Less-Rigid Data
Storage Model
“Percentiles”
15. The “Web 2.0” Era
2008+ ~2010 to ~2016
2008+
Graphite, StatsD
RRDTool → Whisper
April 2010
OpenTSDB
2012
Splunk IPO
2015
Prometheus Announced
16. - The Cadillac of “Big Data Analytics
Tools”
- Simply query syntax and delicious
graphing user interface
- Often imitated (SumoLogic,
LogDNA, etc)
Key Wins
Full-Text Search
“Excel”-like UX
“Democratize”
Access to Logs
17.
18. - Required dedicated admins to
keep “junk” logs out
- Full-text search became a crutch,
logs used instead of lighter-weight
native instrumentation
- Hard to justify the ever-growing
costs when provided as a central
service
TM Challenges
Policing Usage
PII Data Escapes
Cost Shifting
19. The “Web 2.0” Era
2008+ ~2010 to ~2016
2008+
Graphite, StatsD
RRDTool → Whisper
April 2010
OpenTSDB
2012
Splunk IPO
2015
Prometheus Announced
20. - Leverages Hadoop HBase
column-oriented data store to be
horizontally scalable
- Simple metrics listening daemon
and wire protocol (“time series
daemon”)
- Minimalist GUI that simply spits
out GNUPlot charts
Key Wins
Clustered
Non-Compacting
“Big Data”
21.
22. - Over-zealous instrumentation in
hot code paths triggered
show-stopping outages
- TSDs queue’ing metrics and
attempting to re-connect all at
once killed a datacenter-wide
firewall
- Large queries of historical data
could block writes and thrash the
hot cache of recent metrics
- Teams always wanted to throw
high cardinality data at it (“page
speed metrics per show)
TM Challenges
HBase
Management
Thundering
Herds
Limited row key
“namespace”
25. Prometheus
● Originated at SoundCloud by former Google SREs
● Based on Borgmon
● Pull model
● Dimensional data
● PromQL query language
● Local & remote storage
● Dozens of exporters & integrations
26. What’s next?
In a word: observability
● Moar better metrics
● Distributed tracing (OpenTracing, Jaeger)
● Better logs