SlideShare a Scribd company logo
1 of 66
Download to read offline
Monitoring at
Spotify
- When things go ping in the night
Martin Parm, Product owner for Monitoring
‣ Martin Parm, Danish, 36 years old
‣ Master degree in CS from Copenhagen
University
‣ IT operations and infrastructure since 2004
‣ Joined Spotify in 2012 (and moved to Sweden)
‣ Joined Spotify’s monitoring team in february
2014
‣ Currently Product Owner for monitoring
About Martin Parm
This talk neither endorses nor condemn any
specific products, open source or otherwise. All
opinions expressed are in the context of Spotify’s
specific history with monitoring.
Disclaimer
This talk is not a sales pitch for our monitoring
solution. It’s a story about how it came to be.
Disclaimer (2)
Spotify -
what we do
‣ Music streaming - and discovery
‣ Clients for all major operating systems
‣ Partner integration into speakers, TVs,
PlayStation, ChromeCast, etc.
‣ More than 20 million paid subscribers*
‣ More than 75 million daily active users*
‣ Paid back more than $3 billion in royalties*
The service - the right music for every
moment
* https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/
‣ Major tech offices in Stockholm, Gothenburg,
New York and San Francisco
‣ Small sales and label relations offices in all
markets (read: countries)
‣ ~1500 employees worldwide
○ ~50% in Technology
○ ~100 people in IO, our infrastructure department
○ 6 engineers in our monitoring team
The people
‣ 4 physical data centers
‣ ~10K physical servers
‣ Microservice infrastructure with ~1000 different
services
‣ Mostly Ubuntu Linux
The technology
‣ Ops-In-Squads
○ Distributed operational responsibility in the feature teams
○ Monitoring as self-service for backend
‣ Two monitoring systems for backend
○ Heroic - time series based graphs and alerts
○ Riemann - event based alerting
‣ ~100M time series
‣ ~750 graph-based alert definitions
‣ ~1500 graph dashboards
Operations and monitoring
Our story
begins back in
2006...
Spotify starts in 2006 - a different world
‣ No cloud computing
○ AWS had only just launched
○ Google App engine and Microsoft Azure didn’t exist yet
‣ Few cloud applications
○ GMail was still in limited public beta
○ Google Docs was still in limited testing
○ Facebook opened for public access in September
‣ No smartphones
○ Apple had not even unveiled the iPhone yet
○ Android didn’t become available until 2008
Minutes from meeting in June 2007
“Munin mail -
It's not sustainable to get
100 status mails per day
about full disks to dev. X
should turn them off.”
‣ Sitemon; our first graphing system
○ Based on Munin but with a custom frontend
○ Metrics were pulled from hosts by aggregators
○ Metrics were written to several different RRD files with different
solutions
○ Static graphs were generated from these RRD files
○ One single dashboard for all systems and users
○ Main metrics: Backend Request Failures
First steps with monitoring
-- Emil Fredriksson, Operations director at Spotify
(Slightly paragraphed)
“Our first alerting system was an
engineer, who looked at graphs all
the time; day and night; weekends.
And he would start calling up
people, when something looked
wrong.”
‣ Spotify launched in Sweden and UK in October
2008
‣ Zabbix was introduced in September 2009
‣ Alerts were sent as text messages to
Operations, who would then contact feature
developers
‣ Most common “alerting source”: Users
○ Operations had a permanent twitter search
First steps with alerting
2011/2012:
Ops in squads
‣ Opened our 3rd data center and grew to ~1000
hosts
‣ Spotify grew from ~100 to 400-600 people
worldwide in a few months
‣ Many new engineers didn’t have operational
experience or DevOps mentality
‣ A rift between dev and ops emerged...
2011-2012: The 2nd great hiring spree
‣ Development speed-up and a vast increase in
new services
‣ Stability and reliability was an increasing
problem
‣ Service ownership was often unclear
‣ Too frequent changes for a monolithic SRE
team to keep up
2011-2012: The 2nd great hiring spree
‣ Big incidents almost every week → The
business were unhappy
‣ Constant panic and fire fighting → The SRE
team were unhappy
‣ Policies and restrictions, and angry SRE
engineers → The feature developers were
unhappy
2011-2012: The 2nd great hiring spree
“The infrastructure and feature squads that write
services should also take responsibility for correct
day-to-day operation of individual services.”
‣ Capacity Planning
‣ Service Configuration and Deployment
‣ Monitoring and Alerting
‣ Defining and Managing to SLAs
‣ Managing Incidents
September 2012: Ops In Squads
Benefits
‣ Organizational Scalability
‣ Faster incident solving - getting The Right
Person™ on the problem faster
‣ Accountability - making the right people hurt
‣ Autonomy - feature teams make all their own
planning and decisions
September 2012: Ops In Squads
Human challenges
‣ Developers need training, but not a new
education
‣ Developers need autonomy, but will do stupid
things
‣ Developers need to care about monitoring and
alerting, but not the monitoring pipeline
September 2012: Ops In Squads
We needed infrastructure as services
‣ The classis Operations team was disbanded
‣ Operations engineers and tools teams were
reformed in IO, our infrastructure organization
‣ Teaching and self-service became a primary
priority
September 2012: Ops In Squads
“Creating IO was
probably one of the
smartest moves in
Spotify”
-- Previous Product Owner for Monitoring
(Slightly paragraphed)
Three tales of
failure*
* Read: Learning opportunities
‣ Late 2011: Backend Infrastructure Team (BIT)
was formed
○ BIT was the first infrastructure team at Spotify
‣ Tasked with log delivery, monitoring and
alerting
‣ Development of Sitemon2 began
○ Meant to replace Sitemon
○ Still based on Munin, but with a Cassandra backend and much
more powerful frontend
Sitemon2 - The graphing system, which
never launched
...but BIT was set up for failure from the start
‣ Sitemon2 was developed mostly in isolation and
with very little collaboration with developers
‣ Priority collisions: Log delivery was always
more critical than monitoring
‣ Scope creep: BIT tried to integrate Sitemon2
with analytics
Sitemon2 - The graphing system, which
never launched
We needed feature teams to take part in
monitoring, but Zabbix was too inflexible and hard
to learn.
‣ Late 2012: Development of OMG began
○ Event streaming processor similar to Riemann
○ Initial development was super fast and focused
○ Developed in collaboration with Operations
‣ A few teams adopted OMG, but.....
OMG - The alerting system no one
understood
OMG rule written in Esper
{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.
muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].avg
(300)}>0.1&(({TRIGGER.VALUE}=0&{Template_Site:grpsum["{$SPOTIFY_SITE}
Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%
any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE}
Access Points","spotify.muninplugintcp4950
[hermes_requests_discovery,%%any]","last","0"].min(300)}<0.9)|
({TRIGGER.VALUE}=1&{Template_Site:grpsum["{$SPOTIFY_SITE} Access
Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]","
last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access
Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%
any]","last","0"].min(300)}<0.97))
The alerting rule language was Esper (EPL)
‣ Most engineers found the learning curve way
too steep and confusing
‣ Too few tools and libraries for the language
Why was this not caught?
‣ The Ops engineer assigned for the
collaboration happened to also like Esper
OMG - The alerting system no one
understood
‣ February 2013: One of our system architects
builds Monster as a proof-of-concept hack
project
○ In-memory time series database in Java
○ Based on Munin collection and data model
○ Metric data was pushed rather than pulled
○ The prototype was completed in 2 weeks
Monster
‣ Pushing monitoring data was much more
reliable than pulling
‣ Querying and graphing is blazing fast
‣ The Operations engineers loved it!
‣ Sitemon kept running, but development of
Sitemon2 was halted
We’ll get back to the failure part...
Monster
2013: The birth
of a dedicated
monitoring
team
‣ First dedicated monitoring team at Spotify
‣ Assigned with the task of “Providing self-
service monitoring solutions for DevOps teams”
‣ Inherited Monster, Zabbix and OMG
‣ Calculation: Monster could survive a year, so
we focused on alerting first
Hero Squad
‣ Replaced Zabbix and OMG with Riemann
○ “Riemann is an event stream processor”
○ Written in Clojure
○ Rules are also written in Clojure
‣ We built a support library with helper functions,
namespace support and unit testing
‣ Build a web frontend for reading the current
state of Riemann
Riemann as a self-service alerting
system
* Some boilerplate code *
(def-rules
(where (tagged-any roles)
(tagged "monitoring-hooks"
(:alert target))
(where (service "vfs/disk_used/")
(spotify/trigger-on (p/above 0.9)
(with {:description "Disk is getting full"}
(:alert target))))))
Riemann rule written in Clojure
‣ Success: Riemann was widely adopted
‣ Success: Riemann is a true self-service
○ Riemann rules lives in a shared git repo, which gets
automatically deployed
○ Each team/project have it’s own namespace
○ Unit tests ensure that rules work as intended
○ Peak: 36 namespaces and ~5000 lines of Clojure code
‣ Failure: Many engineers didn’t understand or
like the Clojure language
Riemann as a self-service alerting
system
2014: Now for
the pretty
graphs...
‣ Sharding and rebalancing quickly became a
serious operational overhead
○ The whisper write pattern involved randomly seeking and
writing across a lot of files – one for each series
○ The Cyanite backend have recently addressed this
‣ Hierarchical naming
○ Example: “db1.cpu.idle”
○ Difficult to slice and dice metrics across dimension, e.g. select
all host in a site or running a particular piece of software
A brief encounter with collectd and
Graphite
‣ Replace long
hierarchical names
with tags
‣ Compatible with
what Riemann does
with events
‣ Makes slicing and
dicing metrics easy
.... but who supports it?
Metric 2.0
‣ We need quick adoption and commitment from
our feature teams
‣ Monitoring was still very immature and we need
room to experiment and fail
‣ Problem: Engineers get sick of migrations and
refactorings
‣ Solution: A flexible infrastructure and an “API”
Adoption vs. flexibility
‣ Small daemon running on each host, which
forwards events and metrics to monitoring
infrastructure
‣ First written in Ruby, but later ported to Java
‣ Provides a stable entry point for our users
ffwd - a monitoring “API”
Abstracting away the infrastructure from
monitoring collection
Magical
monitoring
pipeline
Metrics and events
ffwd
Alerts
Pretty
graphs
‣ Atlas, developed by NetFlix
○ Hadn’t been open sourced yet
‣ Prometheus, developed by SoundCloud
○ Hadn’t been open sourced yet
‣ OpenTSDB, originally developed by
StumbleUpon
○ Was rejected because of bad experiences with HBase
‣ InfluxDB
○ Was too immature at the time
Tripping towards graphing; it’s all about
the timing
‣ Time series database written in Java and
backed by Cassandra
‣ Looked promising at first
○ We deployed it and killed Sitemon for good
○ We quickly ran into problems with the query engine
‣ Timing: The two main developers got hired by
DataStax (the Cassandra company)
○ KairosDB development went to a halt
KairosDB
‣ Originally written as an alternative query engine
for KairosDB
‣ We kept using the KairosDB database schema
and KairosDB metric writers
‣ June 2014: We dropped KairosDB and Heroic
became a stand-alone product
‣ ElasticSearch used for indexing time series
metadata
May 2014: The birth of Heroic
Monster couldn’t scale, but this was not obvious to
the users
‣ When it worked, it was blazing fast and beat all
other solutions
‣ When it broke, it crashed and required the
attention of the monitoring team, but most
users never knew
‣ Only visible sign: shorter and shorter history
Back to the Monster failure
‣ Failure: Because Monster was loved, and the
users weren’t experiencing the pain when it
broke, many teams resisted migrating from
Monster
‣ Result: We didn’t manage to shut down
Monster until August 2015
‣ In it’s last 6 weeks Monster crashed 51(!) times
Back to the Monster failure
July 2014:
Graph-based
alerting
Alerting was becoming a problem again
‣ Scaling Riemann with the increasing number of
metrics became hard
○ We began sharding, but some groups of hosts were still too big
‣ Writing reliable sliding window rules in Riemann
was hard
○ Learning Riemann and Clojure was the most common
complaint from our users
‣ One team dropped our monitoring solution and
moved to an external vendor
‣ Simple thresholds on time series using the
same backend, data and query language
○ 3 operators: Above, Below or Missing for X time
‣ Integrated directly into our frontend
○ No code, no fancy math, just a line on a graph
Graph-based alerting
Graph-based alerting
‣ Our engineers loves it!
○ Thousands of lines of Riemann code was ripped out
○ Many teams have migrated completely away from Riemann
○ We saw a massive speed-up in adoption of monitoring; both
data collection and definitions of dashboard and alerts
‣ Many monitoring problems can indeed be
expressed as a simple threshold
Graph-based alerting
Adoption of Heroic
‣ We are currently collecting ~10TB of metrics
per month worldwide
○ 30TB of storage in Cassandra due to replication factor
‣ ~80% of our data was collected within the last 6
month
Adoption of Heroic
The final current picture
Metrics and
events
ffwd
Alerts
Pretty
graphs
Riemann
Apache
Kafka
Heroic
‣ ffwd and ffwd-java has been developed as
Open Source software from the start
‣ Heroic was released as Open Source software
yesterday
○ Blog post: “Monitoring at Spotify: Introducing Heroic”
○ Other components are being released later
We finally Open Sourced it
What we
have learned
so far
● Learning a new monitoring
system is an investment
● Legacy systems are almost
always to hardest
But it gets worse...
● Almost all system ends as
legacy
● You probably haven’t
installed your last
monitoring system
Migrations are hard and expensive
Suggestions:
● Consider having abstraction
layers
● Beware of vendor lock-in
○ Open Source software is not
safe
● Sometimes it’s cheaper to
keep a migration layer for
legacy systems than
migrating
● The monitoring/operations
team are experts; feature
developers might not be
● User experience matters for
adoption
● The learning curve affects
the cost of adoption for
teams
User experience and learning curve
matters
● A technically superior
solution is worthless, if your
users don’t understand it
● Providing good defaults and
a easy golden path will not
only drive adoption, but also
prevent users from making
common mistakes
When collection is easy and
performance is good, engineers
will start using the monitoring
system as a debugger.
● Storage is cheap but not
free
● The operational cost of
keeping debugging data
highly available is
significant
Beware of scope creep in monitoring
When graphing is easy, pretty
and powerful, people will start
using monitoring for business
analytics.
● Monitoring is suppose to be
reliable, but not accurate
● Seems very intuitive
● Fragile, sensitive to latency
and sporadic failures
● Noise for alerting
● What are you really
measuring?
● Solution: Convert your
problem into a metric by
interpreting close to the
source
Heartbeats are hard to get right
● We used to sent events on every Puppet run
● Teams would make monitoring rules for failed Puppet runs
and absent Puppet runs
● Problem: Absent Puppet runs looks exactly the same when
○ Puppet is disabled
○ Network is down
○ Host is down
○ Host has been decommissioned
● Solution: Emit “Time since last successful Puppet run” metric
instead
○ Now we can do simple thresholds, which are easy to reason about
Heartbeats example: Puppet runs
● Indexing 100M time series
is hard
● Browsing 100M time series
is hard
○ UI design - getting an overview
of 100M time series is hard
○ Understanding a graph with
thousands of lines is difficult for
humans
● Your data will keep growing
The next big scaling problem is very
human: data discovery
● Anomaly detection and
machine learning might
help us
○ Many new and upcoming
product looks promising
○ ...but still largely an unsolved
problem
Thank you for
your time and
patience!
Martin Parm
email: parmus@spotify.com
twitter: @parmus_dk
List of Open Source software mentioned
‣ Munin, http://munin-monitoring.org/
‣ Zabbix, http://www.zabbix.com/
‣ Riemann, http://riemann.io/
‣ Apache Kafka, http://kafka.apache.org/
‣ Atlas, https://github.com/Netflix/atlas
‣ Prometheus, http://prometheus.io/
‣ OpenTSDM, http://opentsdb.net/
‣ InfluxDB, https://influxdb.com/
‣ KairosDB, http://kairosdb.github.io/
‣ ffwd, https://github.com/spotify/ffwd
‣ ffwd-java, https://github.com/spotify/ffwd-java
‣ Heroic, https://github.com/spotify/heroic
‣ Cassandra, http://cassandra.apache.org/
‣ ElasticSearch, https://www.elastic.co/

More Related Content

What's hot

AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec Program
AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec ProgramAppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec Program
AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec ProgramMatt Tesauro
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15Ed Bellis
 
Making security-agile matt-tesauro
Making security-agile matt-tesauroMaking security-agile matt-tesauro
Making security-agile matt-tesauroMatt Tesauro
 
AppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityAppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityMatt Tesauro
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleJulien Pivotto
 
Continuous Security: Using Automation to Expand Security's Reach
Continuous Security: Using Automation to Expand Security's ReachContinuous Security: Using Automation to Expand Security's Reach
Continuous Security: Using Automation to Expand Security's ReachMatt Tesauro
 
Building a Secure DevOps Pipeline - for your AppSec Program
Building a Secure DevOps Pipeline - for your AppSec Program   Building a Secure DevOps Pipeline - for your AppSec Program
Building a Secure DevOps Pipeline - for your AppSec Program Matt Tesauro
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineMatt Tesauro
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandMatt Tesauro
 
DevOps, CLI, APIs, Oh My! Security Gone Agile
DevOps, CLI, APIs, Oh My!  Security Gone AgileDevOps, CLI, APIs, Oh My!  Security Gone Agile
DevOps, CLI, APIs, Oh My! Security Gone AgileMatt Tesauro
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationPuppet
 
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowBoston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowAndreas Grabner
 
Peeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityPeeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityMatt Tesauro
 
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...Matt Tesauro
 
OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!Matt Tesauro
 
DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.Matt Tesauro
 
Taking the Best of Agile, DevOps and CI/CD into security
Taking the Best of Agile, DevOps and CI/CD into securityTaking the Best of Agile, DevOps and CI/CD into security
Taking the Best of Agile, DevOps and CI/CD into securityMatt Tesauro
 
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysAndreas Grabner
 
DevOps Pipelines and Metrics Driven Feedback Loops
DevOps Pipelines and Metrics Driven Feedback LoopsDevOps Pipelines and Metrics Driven Feedback Loops
DevOps Pipelines and Metrics Driven Feedback LoopsAndreas Grabner
 
How to explain DevOps to your mom
How to explain DevOps to your momHow to explain DevOps to your mom
How to explain DevOps to your momAndreas Grabner
 

What's hot (20)

AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec Program
AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec ProgramAppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec Program
AppSec++ Take the best of Agile, DevOps and CI/CD into your AppSec Program
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15
 
Making security-agile matt-tesauro
Making security-agile matt-tesauroMaking security-agile matt-tesauro
Making security-agile matt-tesauro
 
AppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityAppSec Pipelines and Event based Security
AppSec Pipelines and Event based Security
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code Style
 
Continuous Security: Using Automation to Expand Security's Reach
Continuous Security: Using Automation to Expand Security's ReachContinuous Security: Using Automation to Expand Security's Reach
Continuous Security: Using Automation to Expand Security's Reach
 
Building a Secure DevOps Pipeline - for your AppSec Program
Building a Secure DevOps Pipeline - for your AppSec Program   Building a Secure DevOps Pipeline - for your AppSec Program
Building a Secure DevOps Pipeline - for your AppSec Program
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec Pipeline
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP Switzerland
 
DevOps, CLI, APIs, Oh My! Security Gone Agile
DevOps, CLI, APIs, Oh My!  Security Gone AgileDevOps, CLI, APIs, Oh My!  Security Gone Agile
DevOps, CLI, APIs, Oh My! Security Gone Agile
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automation
 
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowBoston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
 
Peeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityPeeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API Security
 
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
Making Continuous Security a Reality with OWASP’s AppSec Pipeline - Matt Tesa...
 
OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!OWASP WTE - Now in the Cloud!
OWASP WTE - Now in the Cloud!
 
DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.DevSecOps Fundamentals and the Scars to Prove it.
DevSecOps Fundamentals and the Scars to Prove it.
 
Taking the Best of Agile, DevOps and CI/CD into security
Taking the Best of Agile, DevOps and CI/CD into securityTaking the Best of Agile, DevOps and CI/CD into security
Taking the Best of Agile, DevOps and CI/CD into security
 
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
 
DevOps Pipelines and Metrics Driven Feedback Loops
DevOps Pipelines and Metrics Driven Feedback LoopsDevOps Pipelines and Metrics Driven Feedback Loops
DevOps Pipelines and Metrics Driven Feedback Loops
 
How to explain DevOps to your mom
How to explain DevOps to your momHow to explain DevOps to your mom
How to explain DevOps to your mom
 

Viewers also liked

Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessAdrian Cockcroft
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoringLeon Fayer
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleChris Jackson
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Appsbrucelawson
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanizecoreygoldberg
 
Continuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeContinuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeNicole Forsgren
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 

Viewers also liked (14)

2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoring
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
 
How to Speak "Manager"
How to Speak "Manager"How to Speak "Manager"
How to Speak "Manager"
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanize
 
Continuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeContinuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps Awesome
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 

Similar to OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteGeoff Halprin
 
WSO2Con USA 2015: Keynote - Helping You Connect the World
WSO2Con USA 2015: Keynote - Helping You Connect the WorldWSO2Con USA 2015: Keynote - Helping You Connect the World
WSO2Con USA 2015: Keynote - Helping You Connect the WorldWSO2
 
Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]Dynatrace
 
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...Matt Stubbs
 
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppDynamics
 
MyNOG 9: Vulnerability Reporting Program on a Shoestring Budget
MyNOG 9: Vulnerability Reporting Program on a Shoestring BudgetMyNOG 9: Vulnerability Reporting Program on a Shoestring Budget
MyNOG 9: Vulnerability Reporting Program on a Shoestring BudgetAPNIC
 
Vulnerability Reporting Program on a Shoestring Budget by Jamie Gillespie, A...
Vulnerability Reporting Program on a Shoestring Budget  by Jamie Gillespie, A...Vulnerability Reporting Program on a Shoestring Budget  by Jamie Gillespie, A...
Vulnerability Reporting Program on a Shoestring Budget by Jamie Gillespie, A...MyNOG
 
Data to the Masses: Automated Word Document Creation with FME
Data to the Masses: Automated Word Document Creation with FMEData to the Masses: Automated Word Document Creation with FME
Data to the Masses: Automated Word Document Creation with FMESafe Software
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservicesDynatrace
 
State of Streams | Gwen Shapira, Fall 2018
State of Streams | Gwen Shapira, Fall 2018State of Streams | Gwen Shapira, Fall 2018
State of Streams | Gwen Shapira, Fall 2018confluent
 
Microservices And Containerization by Steven Mason
Microservices And Containerization by Steven MasonMicroservices And Containerization by Steven Mason
Microservices And Containerization by Steven MasonSynerzip
 
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNIC
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNICAusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNIC
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNICAPNIC
 
Code to Release using Artificial Intelligence and Machine Learning
Code to Release using Artificial Intelligence and Machine LearningCode to Release using Artificial Intelligence and Machine Learning
Code to Release using Artificial Intelligence and Machine LearningSTePINForum
 
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat Llama
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat LlamaMigrating to an Agile Architecture, Will Demaine, Engineer, Fat Llama
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat LlamaUXDXConf
 
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA Program
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA ProgramAppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA Program
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA ProgramDenim Group
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...WSO2
 
Hidden Speed Bumps on the Road to "Continuous"
Hidden Speed Bumps on the Road to "Continuous"Hidden Speed Bumps on the Road to "Continuous"
Hidden Speed Bumps on the Road to "Continuous"Sonatype
 

Similar to OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm (20)

Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynote
 
WSO2Con USA 2015: Keynote - Helping You Connect the World
WSO2Con USA 2015: Keynote - Helping You Connect the WorldWSO2Con USA 2015: Keynote - Helping You Connect the World
WSO2Con USA 2015: Keynote - Helping You Connect the World
 
Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]Accelerate User Driven Innovation [Webinar]
Accelerate User Driven Innovation [Webinar]
 
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...
Big Data LDN 2018: SCALING A PLATFORM FOR REAL-TIME FRAUD DETECTION WITHOUT B...
 
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
 
MyNOG 9: Vulnerability Reporting Program on a Shoestring Budget
MyNOG 9: Vulnerability Reporting Program on a Shoestring BudgetMyNOG 9: Vulnerability Reporting Program on a Shoestring Budget
MyNOG 9: Vulnerability Reporting Program on a Shoestring Budget
 
Vulnerability Reporting Program on a Shoestring Budget by Jamie Gillespie, A...
Vulnerability Reporting Program on a Shoestring Budget  by Jamie Gillespie, A...Vulnerability Reporting Program on a Shoestring Budget  by Jamie Gillespie, A...
Vulnerability Reporting Program on a Shoestring Budget by Jamie Gillespie, A...
 
Data to the Masses: Automated Word Document Creation with FME
Data to the Masses: Automated Word Document Creation with FMEData to the Masses: Automated Word Document Creation with FME
Data to the Masses: Automated Word Document Creation with FME
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices
 
State of Streams | Gwen Shapira, Fall 2018
State of Streams | Gwen Shapira, Fall 2018State of Streams | Gwen Shapira, Fall 2018
State of Streams | Gwen Shapira, Fall 2018
 
Microservices And Containerization by Steven Mason
Microservices And Containerization by Steven MasonMicroservices And Containerization by Steven Mason
Microservices And Containerization by Steven Mason
 
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNIC
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNICAusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNIC
AusCERT2022: Vulnerability Reporting Program on a Shoestring Budget - APNIC
 
Code to Release using Artificial Intelligence and Machine Learning
Code to Release using Artificial Intelligence and Machine LearningCode to Release using Artificial Intelligence and Machine Learning
Code to Release using Artificial Intelligence and Machine Learning
 
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat Llama
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat LlamaMigrating to an Agile Architecture, Will Demaine, Engineer, Fat Llama
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat Llama
 
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA Program
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA ProgramAppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA Program
AppSec Fast and Slow: Your DevSecOps CI/CD Pipeline Isn’t an SSA Program
 
Intro to sitespeed.io
Intro to sitespeed.ioIntro to sitespeed.io
Intro to sitespeed.io
 
Yatoto-technical
Yatoto-technicalYatoto-technical
Yatoto-technical
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
 
Hidden Speed Bumps on the Road to "Continuous"
Hidden Speed Bumps on the Road to "Continuous"Hidden Speed Bumps on the Road to "Continuous"
Hidden Speed Bumps on the Road to "Continuous"
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

  • 1. Monitoring at Spotify - When things go ping in the night Martin Parm, Product owner for Monitoring
  • 2. ‣ Martin Parm, Danish, 36 years old ‣ Master degree in CS from Copenhagen University ‣ IT operations and infrastructure since 2004 ‣ Joined Spotify in 2012 (and moved to Sweden) ‣ Joined Spotify’s monitoring team in february 2014 ‣ Currently Product Owner for monitoring About Martin Parm
  • 3. This talk neither endorses nor condemn any specific products, open source or otherwise. All opinions expressed are in the context of Spotify’s specific history with monitoring. Disclaimer
  • 4. This talk is not a sales pitch for our monitoring solution. It’s a story about how it came to be. Disclaimer (2)
  • 6. ‣ Music streaming - and discovery ‣ Clients for all major operating systems ‣ Partner integration into speakers, TVs, PlayStation, ChromeCast, etc. ‣ More than 20 million paid subscribers* ‣ More than 75 million daily active users* ‣ Paid back more than $3 billion in royalties* The service - the right music for every moment * https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/
  • 7. ‣ Major tech offices in Stockholm, Gothenburg, New York and San Francisco ‣ Small sales and label relations offices in all markets (read: countries) ‣ ~1500 employees worldwide ○ ~50% in Technology ○ ~100 people in IO, our infrastructure department ○ 6 engineers in our monitoring team The people
  • 8. ‣ 4 physical data centers ‣ ~10K physical servers ‣ Microservice infrastructure with ~1000 different services ‣ Mostly Ubuntu Linux The technology
  • 9. ‣ Ops-In-Squads ○ Distributed operational responsibility in the feature teams ○ Monitoring as self-service for backend ‣ Two monitoring systems for backend ○ Heroic - time series based graphs and alerts ○ Riemann - event based alerting ‣ ~100M time series ‣ ~750 graph-based alert definitions ‣ ~1500 graph dashboards Operations and monitoring
  • 10. Our story begins back in 2006...
  • 11. Spotify starts in 2006 - a different world ‣ No cloud computing ○ AWS had only just launched ○ Google App engine and Microsoft Azure didn’t exist yet ‣ Few cloud applications ○ GMail was still in limited public beta ○ Google Docs was still in limited testing ○ Facebook opened for public access in September ‣ No smartphones ○ Apple had not even unveiled the iPhone yet ○ Android didn’t become available until 2008
  • 12. Minutes from meeting in June 2007 “Munin mail - It's not sustainable to get 100 status mails per day about full disks to dev. X should turn them off.”
  • 13. ‣ Sitemon; our first graphing system ○ Based on Munin but with a custom frontend ○ Metrics were pulled from hosts by aggregators ○ Metrics were written to several different RRD files with different solutions ○ Static graphs were generated from these RRD files ○ One single dashboard for all systems and users ○ Main metrics: Backend Request Failures First steps with monitoring
  • 14. -- Emil Fredriksson, Operations director at Spotify (Slightly paragraphed) “Our first alerting system was an engineer, who looked at graphs all the time; day and night; weekends. And he would start calling up people, when something looked wrong.”
  • 15. ‣ Spotify launched in Sweden and UK in October 2008 ‣ Zabbix was introduced in September 2009 ‣ Alerts were sent as text messages to Operations, who would then contact feature developers ‣ Most common “alerting source”: Users ○ Operations had a permanent twitter search First steps with alerting
  • 17. ‣ Opened our 3rd data center and grew to ~1000 hosts ‣ Spotify grew from ~100 to 400-600 people worldwide in a few months ‣ Many new engineers didn’t have operational experience or DevOps mentality ‣ A rift between dev and ops emerged... 2011-2012: The 2nd great hiring spree
  • 18. ‣ Development speed-up and a vast increase in new services ‣ Stability and reliability was an increasing problem ‣ Service ownership was often unclear ‣ Too frequent changes for a monolithic SRE team to keep up 2011-2012: The 2nd great hiring spree
  • 19. ‣ Big incidents almost every week → The business were unhappy ‣ Constant panic and fire fighting → The SRE team were unhappy ‣ Policies and restrictions, and angry SRE engineers → The feature developers were unhappy 2011-2012: The 2nd great hiring spree
  • 20. “The infrastructure and feature squads that write services should also take responsibility for correct day-to-day operation of individual services.” ‣ Capacity Planning ‣ Service Configuration and Deployment ‣ Monitoring and Alerting ‣ Defining and Managing to SLAs ‣ Managing Incidents September 2012: Ops In Squads
  • 21. Benefits ‣ Organizational Scalability ‣ Faster incident solving - getting The Right Person™ on the problem faster ‣ Accountability - making the right people hurt ‣ Autonomy - feature teams make all their own planning and decisions September 2012: Ops In Squads
  • 22. Human challenges ‣ Developers need training, but not a new education ‣ Developers need autonomy, but will do stupid things ‣ Developers need to care about monitoring and alerting, but not the monitoring pipeline September 2012: Ops In Squads
  • 23. We needed infrastructure as services ‣ The classis Operations team was disbanded ‣ Operations engineers and tools teams were reformed in IO, our infrastructure organization ‣ Teaching and self-service became a primary priority September 2012: Ops In Squads
  • 24. “Creating IO was probably one of the smartest moves in Spotify” -- Previous Product Owner for Monitoring (Slightly paragraphed)
  • 25. Three tales of failure* * Read: Learning opportunities
  • 26. ‣ Late 2011: Backend Infrastructure Team (BIT) was formed ○ BIT was the first infrastructure team at Spotify ‣ Tasked with log delivery, monitoring and alerting ‣ Development of Sitemon2 began ○ Meant to replace Sitemon ○ Still based on Munin, but with a Cassandra backend and much more powerful frontend Sitemon2 - The graphing system, which never launched
  • 27. ...but BIT was set up for failure from the start ‣ Sitemon2 was developed mostly in isolation and with very little collaboration with developers ‣ Priority collisions: Log delivery was always more critical than monitoring ‣ Scope creep: BIT tried to integrate Sitemon2 with analytics Sitemon2 - The graphing system, which never launched
  • 28. We needed feature teams to take part in monitoring, but Zabbix was too inflexible and hard to learn. ‣ Late 2012: Development of OMG began ○ Event streaming processor similar to Riemann ○ Initial development was super fast and focused ○ Developed in collaboration with Operations ‣ A few teams adopted OMG, but..... OMG - The alerting system no one understood
  • 29. OMG rule written in Esper {Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify. muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].avg (300)}>0.1&(({TRIGGER.VALUE}=0&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%% any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950 [hermes_requests_discovery,%%any]","last","0"].min(300)}<0.9)| ({TRIGGER.VALUE}=1&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]"," last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%% any]","last","0"].min(300)}<0.97))
  • 30. The alerting rule language was Esper (EPL) ‣ Most engineers found the learning curve way too steep and confusing ‣ Too few tools and libraries for the language Why was this not caught? ‣ The Ops engineer assigned for the collaboration happened to also like Esper OMG - The alerting system no one understood
  • 31. ‣ February 2013: One of our system architects builds Monster as a proof-of-concept hack project ○ In-memory time series database in Java ○ Based on Munin collection and data model ○ Metric data was pushed rather than pulled ○ The prototype was completed in 2 weeks Monster
  • 32. ‣ Pushing monitoring data was much more reliable than pulling ‣ Querying and graphing is blazing fast ‣ The Operations engineers loved it! ‣ Sitemon kept running, but development of Sitemon2 was halted We’ll get back to the failure part... Monster
  • 33. 2013: The birth of a dedicated monitoring team
  • 34. ‣ First dedicated monitoring team at Spotify ‣ Assigned with the task of “Providing self- service monitoring solutions for DevOps teams” ‣ Inherited Monster, Zabbix and OMG ‣ Calculation: Monster could survive a year, so we focused on alerting first Hero Squad
  • 35. ‣ Replaced Zabbix and OMG with Riemann ○ “Riemann is an event stream processor” ○ Written in Clojure ○ Rules are also written in Clojure ‣ We built a support library with helper functions, namespace support and unit testing ‣ Build a web frontend for reading the current state of Riemann Riemann as a self-service alerting system
  • 36. * Some boilerplate code * (def-rules (where (tagged-any roles) (tagged "monitoring-hooks" (:alert target)) (where (service "vfs/disk_used/") (spotify/trigger-on (p/above 0.9) (with {:description "Disk is getting full"} (:alert target)))))) Riemann rule written in Clojure
  • 37. ‣ Success: Riemann was widely adopted ‣ Success: Riemann is a true self-service ○ Riemann rules lives in a shared git repo, which gets automatically deployed ○ Each team/project have it’s own namespace ○ Unit tests ensure that rules work as intended ○ Peak: 36 namespaces and ~5000 lines of Clojure code ‣ Failure: Many engineers didn’t understand or like the Clojure language Riemann as a self-service alerting system
  • 38. 2014: Now for the pretty graphs...
  • 39. ‣ Sharding and rebalancing quickly became a serious operational overhead ○ The whisper write pattern involved randomly seeking and writing across a lot of files – one for each series ○ The Cyanite backend have recently addressed this ‣ Hierarchical naming ○ Example: “db1.cpu.idle” ○ Difficult to slice and dice metrics across dimension, e.g. select all host in a site or running a particular piece of software A brief encounter with collectd and Graphite
  • 40. ‣ Replace long hierarchical names with tags ‣ Compatible with what Riemann does with events ‣ Makes slicing and dicing metrics easy .... but who supports it? Metric 2.0
  • 41. ‣ We need quick adoption and commitment from our feature teams ‣ Monitoring was still very immature and we need room to experiment and fail ‣ Problem: Engineers get sick of migrations and refactorings ‣ Solution: A flexible infrastructure and an “API” Adoption vs. flexibility
  • 42. ‣ Small daemon running on each host, which forwards events and metrics to monitoring infrastructure ‣ First written in Ruby, but later ported to Java ‣ Provides a stable entry point for our users ffwd - a monitoring “API”
  • 43. Abstracting away the infrastructure from monitoring collection Magical monitoring pipeline Metrics and events ffwd Alerts Pretty graphs
  • 44. ‣ Atlas, developed by NetFlix ○ Hadn’t been open sourced yet ‣ Prometheus, developed by SoundCloud ○ Hadn’t been open sourced yet ‣ OpenTSDB, originally developed by StumbleUpon ○ Was rejected because of bad experiences with HBase ‣ InfluxDB ○ Was too immature at the time Tripping towards graphing; it’s all about the timing
  • 45. ‣ Time series database written in Java and backed by Cassandra ‣ Looked promising at first ○ We deployed it and killed Sitemon for good ○ We quickly ran into problems with the query engine ‣ Timing: The two main developers got hired by DataStax (the Cassandra company) ○ KairosDB development went to a halt KairosDB
  • 46. ‣ Originally written as an alternative query engine for KairosDB ‣ We kept using the KairosDB database schema and KairosDB metric writers ‣ June 2014: We dropped KairosDB and Heroic became a stand-alone product ‣ ElasticSearch used for indexing time series metadata May 2014: The birth of Heroic
  • 47. Monster couldn’t scale, but this was not obvious to the users ‣ When it worked, it was blazing fast and beat all other solutions ‣ When it broke, it crashed and required the attention of the monitoring team, but most users never knew ‣ Only visible sign: shorter and shorter history Back to the Monster failure
  • 48. ‣ Failure: Because Monster was loved, and the users weren’t experiencing the pain when it broke, many teams resisted migrating from Monster ‣ Result: We didn’t manage to shut down Monster until August 2015 ‣ In it’s last 6 weeks Monster crashed 51(!) times Back to the Monster failure
  • 50. Alerting was becoming a problem again ‣ Scaling Riemann with the increasing number of metrics became hard ○ We began sharding, but some groups of hosts were still too big ‣ Writing reliable sliding window rules in Riemann was hard ○ Learning Riemann and Clojure was the most common complaint from our users ‣ One team dropped our monitoring solution and moved to an external vendor
  • 51. ‣ Simple thresholds on time series using the same backend, data and query language ○ 3 operators: Above, Below or Missing for X time ‣ Integrated directly into our frontend ○ No code, no fancy math, just a line on a graph Graph-based alerting
  • 53. ‣ Our engineers loves it! ○ Thousands of lines of Riemann code was ripped out ○ Many teams have migrated completely away from Riemann ○ We saw a massive speed-up in adoption of monitoring; both data collection and definitions of dashboard and alerts ‣ Many monitoring problems can indeed be expressed as a simple threshold Graph-based alerting
  • 55. ‣ We are currently collecting ~10TB of metrics per month worldwide ○ 30TB of storage in Cassandra due to replication factor ‣ ~80% of our data was collected within the last 6 month Adoption of Heroic
  • 56. The final current picture Metrics and events ffwd Alerts Pretty graphs Riemann Apache Kafka Heroic
  • 57. ‣ ffwd and ffwd-java has been developed as Open Source software from the start ‣ Heroic was released as Open Source software yesterday ○ Blog post: “Monitoring at Spotify: Introducing Heroic” ○ Other components are being released later We finally Open Sourced it
  • 59. ● Learning a new monitoring system is an investment ● Legacy systems are almost always to hardest But it gets worse... ● Almost all system ends as legacy ● You probably haven’t installed your last monitoring system Migrations are hard and expensive Suggestions: ● Consider having abstraction layers ● Beware of vendor lock-in ○ Open Source software is not safe ● Sometimes it’s cheaper to keep a migration layer for legacy systems than migrating
  • 60. ● The monitoring/operations team are experts; feature developers might not be ● User experience matters for adoption ● The learning curve affects the cost of adoption for teams User experience and learning curve matters ● A technically superior solution is worthless, if your users don’t understand it ● Providing good defaults and a easy golden path will not only drive adoption, but also prevent users from making common mistakes
  • 61. When collection is easy and performance is good, engineers will start using the monitoring system as a debugger. ● Storage is cheap but not free ● The operational cost of keeping debugging data highly available is significant Beware of scope creep in monitoring When graphing is easy, pretty and powerful, people will start using monitoring for business analytics. ● Monitoring is suppose to be reliable, but not accurate
  • 62. ● Seems very intuitive ● Fragile, sensitive to latency and sporadic failures ● Noise for alerting ● What are you really measuring? ● Solution: Convert your problem into a metric by interpreting close to the source Heartbeats are hard to get right
  • 63. ● We used to sent events on every Puppet run ● Teams would make monitoring rules for failed Puppet runs and absent Puppet runs ● Problem: Absent Puppet runs looks exactly the same when ○ Puppet is disabled ○ Network is down ○ Host is down ○ Host has been decommissioned ● Solution: Emit “Time since last successful Puppet run” metric instead ○ Now we can do simple thresholds, which are easy to reason about Heartbeats example: Puppet runs
  • 64. ● Indexing 100M time series is hard ● Browsing 100M time series is hard ○ UI design - getting an overview of 100M time series is hard ○ Understanding a graph with thousands of lines is difficult for humans ● Your data will keep growing The next big scaling problem is very human: data discovery ● Anomaly detection and machine learning might help us ○ Many new and upcoming product looks promising ○ ...but still largely an unsolved problem
  • 65. Thank you for your time and patience! Martin Parm email: parmus@spotify.com twitter: @parmus_dk
  • 66. List of Open Source software mentioned ‣ Munin, http://munin-monitoring.org/ ‣ Zabbix, http://www.zabbix.com/ ‣ Riemann, http://riemann.io/ ‣ Apache Kafka, http://kafka.apache.org/ ‣ Atlas, https://github.com/Netflix/atlas ‣ Prometheus, http://prometheus.io/ ‣ OpenTSDM, http://opentsdb.net/ ‣ InfluxDB, https://influxdb.com/ ‣ KairosDB, http://kairosdb.github.io/ ‣ ffwd, https://github.com/spotify/ffwd ‣ ffwd-java, https://github.com/spotify/ffwd-java ‣ Heroic, https://github.com/spotify/heroic ‣ Cassandra, http://cassandra.apache.org/ ‣ ElasticSearch, https://www.elastic.co/