SlideShare a Scribd company logo
1 of 27
Download to read offline
Insights
The Definitive Talk
Introduction
TAM Middleware -> SRT Retail -> SRT Platform / TES
Pretends to be British, but 日本語で話そう
Introduction TES
Monitoring & Alerting Thruk, Nagios, Opsgenie & Grafana
Logging ELK Stack + lot’s of Glue
Metrics Diamond, Statsd, Graphite
Dashboarding Grafana, Kibana, Lookingglass
All the other stuff
The Platform
Q&A
Black & White-box
Monitoring
Functional & Technical
Monitoring
Metric Log
[
1495735876,
sys.cpu.idle_percent,
90
]
{
"line": "29",
"class": "org.eclipse.jetty.EchoFormServlet",
"source_host": "localhost",
"thread_name": "qtp513694835-14",
"message": "Got request from 0:0:0:0:0:0:0",
"@timestamp": "1495735876",
"level": "INFO"
}
3,6 TB, 4 Machines 63 TB, 18 Machines
Healthy
Selfdiagnose page - Oh My..
Service Level Indicator
Service Level Objective
Bol.com Context
Context
- Knowledge
- ... inner working of micro services
- ... broad picture of the landscape
- … specialised in logistics, finance, sales, etc.
Context
- Responsible for
- ... a few micro services
- ... an entire Space
- ... all bol.com network / compute / storage infrastructure
Continuity
- Standby shifts spread over SRT and scrum team
- Teams are dynamic, knowledge is always not retained
Alerting Gone Wrong
Monitoring Check Count
Scrum team (Team1b): 714
Engineer on Duty Middleware: 29.770
Multiply by N amount of environments
1 day’s worth of alerts
Creating Alerts
Best Practices
Symptoms are great SLI’s
Source: Google Site Reliability Engineering
Golden Signals
Source: Google Site Reliability Engineering
Latency
- Time it takes to service a request. Differentiating between failed and
successful requests
Traffic
- How much demand is placed on your service
Errors
- Rate of Failures
Saturation
- How “full” is your service
Creating Alerts - Best Practices
Great Alerts Are:
- Simple
- Urgent
- Actionable
- Require human intervention
Life after the Alert
Life After the Alert - Some advice
Meta:
- Mitigate first
- Strict time period before escalation
Specifics:
- Dashboard with SLI’s
- Link to other dashboards for drill down
- Logs for the audit trail
- Use $tool for more
Future of Insights
at bol.com
Future of Monitoring at bol.com
- Separation of Alerting and Monitoring
- Improve Self Service Monitoring
- Focus on Metric Based Alerting
- Distributed Tracing
Q&A

More Related Content

Similar to Insights: the definitive talk - William Leese

2013 - SVCC - Intuit continuous performance testing
2013 - SVCC - Intuit continuous performance testing2013 - SVCC - Intuit continuous performance testing
2013 - SVCC - Intuit continuous performance testing
Thirugnanam Subbiah
 

Similar to Insights: the definitive talk - William Leese (20)

Monitoring microservices platform
Monitoring microservices platformMonitoring microservices platform
Monitoring microservices platform
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
 
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
제3회난공불락 오픈소스 인프라세미나 - MySQL Performance
 
Exactpro: Non-functional testing approach
Exactpro: Non-functional testing approachExactpro: Non-functional testing approach
Exactpro: Non-functional testing approach
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the Enterprise
 
Taking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - ManagerTaking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - Manager
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Splunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire DataSplunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire Data
 
Troubleshooting Skype for Business
Troubleshooting Skype for BusinessTroubleshooting Skype for Business
Troubleshooting Skype for Business
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
 
Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!
 
Performance Optimization in Large Systems - Cusec 2019
Performance Optimization in Large Systems - Cusec 2019Performance Optimization in Large Systems - Cusec 2019
Performance Optimization in Large Systems - Cusec 2019
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
2013 - SVCC - Intuit continuous performance testing
2013 - SVCC - Intuit continuous performance testing2013 - SVCC - Intuit continuous performance testing
2013 - SVCC - Intuit continuous performance testing
 

More from Bol.com Techlab

More from Bol.com Techlab (20)

The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
Test long and prosper
Test long and prosperTest long and prosper
Test long and prosper
 
The Reactive Rollercoaster
The Reactive RollercoasterThe Reactive Rollercoaster
The Reactive Rollercoaster
 
Best painkiller for Java headache
Best painkiller for Java headacheBest painkiller for Java headache
Best painkiller for Java headache
 
Organizing a conference in 80 days
Organizing a conference in 80 daysOrganizing a conference in 80 days
Organizing a conference in 80 days
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Understanding Operating Systems by breaking them
Understanding Operating Systems by breaking themUnderstanding Operating Systems by breaking them
Understanding Operating Systems by breaking them
 
How to train your dragon
How to train your dragonHow to train your dragon
How to train your dragon
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
Software for drafting a cold beer
Software for drafting a cold beerSoftware for drafting a cold beer
Software for drafting a cold beer
 
Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!
 
How to create your presentation in an iterative way
How to create your presentation in an iterative wayHow to create your presentation in an iterative way
How to create your presentation in an iterative way
 
Wax on, wax off
Wax on, wax offWax on, wax off
Wax on, wax off
 
Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!
 
How the best of Design and Development come together
How the best of Design and Development come togetherHow the best of Design and Development come together
How the best of Design and Development come together
 
The addition to your team you never knew you needed
The addition to your team you never knew you neededThe addition to your team you never knew you needed
The addition to your team you never knew you needed
 
Gravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomyGravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomy
 
Consumer Driven Contract Testing
Consumer Driven Contract TestingConsumer Driven Contract Testing
Consumer Driven Contract Testing
 
I want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecksI want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecks
 
Kubernetes: love at first sight?
Kubernetes: love at first sight?Kubernetes: love at first sight?
Kubernetes: love at first sight?
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Insights: the definitive talk - William Leese