You've heard all about what microservices can do for you. You're convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, in three data centres, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. You can't pick out the important stuff and your inbox is unusable. Something needs to change, and this talk will explain what and how.
You've heard all about what microservices can do for you. You're convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. Something needs to change and this talk will explain what and how.
The present and future of serverless observabilityYan Cui
As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.
Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.
Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.
Applying principles of chaos engineering to ServerlessYan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Embracing DevSecOps: A Changing Security Landscape for the US GovernmentDJ Schleen
As part of this change, all contractors and government software developers will need to think critically and not only ask themselves “does the code have vulnerabilities,” but “could it have vulnerabilities,” and “how do we know either way?”
Learn how with the right tools, and embedded security across the entire development process, you can stay ahead of adversary leaving the software supply chain secure so mindshare can be left for other critical national security issues.
SHOWDOWN: Threat Stack vs. Red Hat AuditDThreat Stack
Traditionally, people have used the userland daemon ‘auditd’ built by some good Red Hat folks to collect and consume this data. However, there are a couple of problems with traditional open source auditd and auditd libraries that we’ve had to deal with ourselves, especially when trying to run it on performance sensitive systems and make sense of the sometimes obtuse data that traditional auditd spits out. To that effect, we’ve written a custom audit listener from the ground up for the Threat Stack agent (tsauditd).
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...Yan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Applying principles of chaos engineering to serverless (CodeMesh)Yan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
You've heard all about what microservices can do for you. You're convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. Something needs to change and this talk will explain what and how.
The present and future of serverless observabilityYan Cui
As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.
Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.
Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.
Applying principles of chaos engineering to ServerlessYan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Embracing DevSecOps: A Changing Security Landscape for the US GovernmentDJ Schleen
As part of this change, all contractors and government software developers will need to think critically and not only ask themselves “does the code have vulnerabilities,” but “could it have vulnerabilities,” and “how do we know either way?”
Learn how with the right tools, and embedded security across the entire development process, you can stay ahead of adversary leaving the software supply chain secure so mindshare can be left for other critical national security issues.
SHOWDOWN: Threat Stack vs. Red Hat AuditDThreat Stack
Traditionally, people have used the userland daemon ‘auditd’ built by some good Red Hat folks to collect and consume this data. However, there are a couple of problems with traditional open source auditd and auditd libraries that we’ve had to deal with ourselves, especially when trying to run it on performance sensitive systems and make sense of the sometimes obtuse data that traditional auditd spits out. To that effect, we’ve written a custom audit listener from the ground up for the Threat Stack agent (tsauditd).
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...Yan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Applying principles of chaos engineering to serverless (CodeMesh)Yan Cui
Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.
Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.
But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?
These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.
Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE - ATT&CKcon
Red Canary’s applied research team built the Atomic Red Team project based on a simple idea: encourage security teams to test their systems.
Leveraging MITRE ATT&CK, the series of small tests can be combined into chains to help teams gain insight into gaps in their security program at all levels. This talk describes how to use Atomic Red Team and how MITRE ATT&CK is leveraged to write the tests.
The present and future of Serverless observability (Serverless Computing London)Yan Cui
As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.
Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.
Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.
Secure your Web Application With The New Python Audit HooksNicolas Vivet
The audit hooks were added to Python 3.8 with the PEP 578. This security mechanism gives you more visibility and control over what your application does at runtime. After a short introduction of the new feature, we will explore ideas on how web developers, library maintainers and security engineers can leverage it to detect and block security vulnerabilities, illustrated with concrete examples.
Microservices 5 things i wish i'd known code motionVincent Kok
Microservices are hot! A lot of companies are experimenting with this architectural pattern that greatly benefits the software development process. When adopting new patterns we always encounter that moment where we think 'if only I knew this three months ago'. This talk will be a sneak peak into the world of microservices at Atlassian and reveal what we've learned about microservices: how to arrange, configure and build your code efficiently; deployment and testing; and how to operate effectively in this environment. In this talk you will learn how to immediately apply five simple strategies.
David Veuve, SE, Splunk, walks the audience through automated threat intelligence response, behavioral profiling, anomaly detection, and tracking an attack against the kill chain.
A look at the types malicious artifacts from Advanced and Commodity attacks, what unique artifacts to look for and how logging caught them for a Windows environment and how LOG-MD can help.
MalwareArchaeology.com
LOG-MD.com
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE - ATT&CKcon
The Vocabulary for Event Recording and Incident Sharing (VERIS) is a set of metrics designed to provide a common language for describing security incidents in a structured and repeatable manner.
VERIS has been the base analysis framework for supporting the Verizon Data Breach Investigations Report (DBIR) since its inception. However, as organizations work to interpret the richness of information present on the DBIR to a more tactical level, the Threat Action Varieties available on VERIS fail to capture the detail present on the incidents recorded.
This talk presents the Verizon Common Attack Framework (VCAF), an effort from the Security Data Science team and the DBIR team in Verizon to expand and map the ATT&CK framework in alignment with the DBIR Threat Action Varieties to provide this much-sought level of granularity in recording and analyzing recorded breaches.
Additionally, this talk describes possible outcomes when this data is available and organized as such. Examples include applying DBIR-inspired attack vector analytics upon ATT&CK layer information, effectively identifying optimal control choke points on the attack graphs according to specific industries covered by the DBIR.
Dev Talk: Event Manipulation and TestingJason Stanley
Jason Stanley from PNC Bank talking about Zenoss event manipulation and testing for transforms, traps and triggers. Starting out simple and driving towards building tools to improve testing/QA.
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2luk9iS.
Tammy Butow shares her experiences using chaos engineering to build resilient systems, when they couldn’t build their systems from scratch. Filmed at qconlondon.com.
Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Previously, she led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers.
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
Often what you monitor and get alerted on is defined by your tools, rather than what makes the most sense to you and your organisation. Alerts on metrics such as CPU usage which are noisy and rarely spot real problems, while outages go undetected. Monitoring systems can also be challenging to maintain, and overall provide a poor return on investment.
In the past few years several new monitoring systems have appeared with more powerful semantics and which are easier to run, which offer a way to vastly improve how your organisation operates and prepare you for a Cloud Native environment. Prometheus is one such system. This talk will look at the monitoring ideal and how whitebox monitoring with a time series database, multi-dimensional labels and a powerful querying/alerting language can free you from midnight pages.
Prometheus is a next-generation monitoring system. It lets you see you not just what your systems look like from the outside, but also gives visibility into the internals and business aspects of your systems. This allows everyone to benefit, including both operations and developers. This talk will look at the concepts behind monitoring with Prometheus, how it's designed, why it's suitable for Cloud Native environments and how you can get involved.
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE - ATT&CKcon
Red Canary’s applied research team built the Atomic Red Team project based on a simple idea: encourage security teams to test their systems.
Leveraging MITRE ATT&CK, the series of small tests can be combined into chains to help teams gain insight into gaps in their security program at all levels. This talk describes how to use Atomic Red Team and how MITRE ATT&CK is leveraged to write the tests.
The present and future of Serverless observability (Serverless Computing London)Yan Cui
As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.
Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.
Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.
Secure your Web Application With The New Python Audit HooksNicolas Vivet
The audit hooks were added to Python 3.8 with the PEP 578. This security mechanism gives you more visibility and control over what your application does at runtime. After a short introduction of the new feature, we will explore ideas on how web developers, library maintainers and security engineers can leverage it to detect and block security vulnerabilities, illustrated with concrete examples.
Microservices 5 things i wish i'd known code motionVincent Kok
Microservices are hot! A lot of companies are experimenting with this architectural pattern that greatly benefits the software development process. When adopting new patterns we always encounter that moment where we think 'if only I knew this three months ago'. This talk will be a sneak peak into the world of microservices at Atlassian and reveal what we've learned about microservices: how to arrange, configure and build your code efficiently; deployment and testing; and how to operate effectively in this environment. In this talk you will learn how to immediately apply five simple strategies.
David Veuve, SE, Splunk, walks the audience through automated threat intelligence response, behavioral profiling, anomaly detection, and tracking an attack against the kill chain.
A look at the types malicious artifacts from Advanced and Commodity attacks, what unique artifacts to look for and how logging caught them for a Windows environment and how LOG-MD can help.
MalwareArchaeology.com
LOG-MD.com
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE - ATT&CKcon
The Vocabulary for Event Recording and Incident Sharing (VERIS) is a set of metrics designed to provide a common language for describing security incidents in a structured and repeatable manner.
VERIS has been the base analysis framework for supporting the Verizon Data Breach Investigations Report (DBIR) since its inception. However, as organizations work to interpret the richness of information present on the DBIR to a more tactical level, the Threat Action Varieties available on VERIS fail to capture the detail present on the incidents recorded.
This talk presents the Verizon Common Attack Framework (VCAF), an effort from the Security Data Science team and the DBIR team in Verizon to expand and map the ATT&CK framework in alignment with the DBIR Threat Action Varieties to provide this much-sought level of granularity in recording and analyzing recorded breaches.
Additionally, this talk describes possible outcomes when this data is available and organized as such. Examples include applying DBIR-inspired attack vector analytics upon ATT&CK layer information, effectively identifying optimal control choke points on the attack graphs according to specific industries covered by the DBIR.
Dev Talk: Event Manipulation and TestingJason Stanley
Jason Stanley from PNC Bank talking about Zenoss event manipulation and testing for transforms, traps and triggers. Starting out simple and driving towards building tools to improve testing/QA.
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2luk9iS.
Tammy Butow shares her experiences using chaos engineering to build resilient systems, when they couldn’t build their systems from scratch. Filmed at qconlondon.com.
Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Previously, she led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers.
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
Often what you monitor and get alerted on is defined by your tools, rather than what makes the most sense to you and your organisation. Alerts on metrics such as CPU usage which are noisy and rarely spot real problems, while outages go undetected. Monitoring systems can also be challenging to maintain, and overall provide a poor return on investment.
In the past few years several new monitoring systems have appeared with more powerful semantics and which are easier to run, which offer a way to vastly improve how your organisation operates and prepare you for a Cloud Native environment. Prometheus is one such system. This talk will look at the monitoring ideal and how whitebox monitoring with a time series database, multi-dimensional labels and a powerful querying/alerting language can free you from midnight pages.
Prometheus is a next-generation monitoring system. It lets you see you not just what your systems look like from the outside, but also gives visibility into the internals and business aspects of your systems. This allows everyone to benefit, including both operations and developers. This talk will look at the concepts behind monitoring with Prometheus, how it's designed, why it's suitable for Cloud Native environments and how you can get involved.
Are you ready for the next attack? Reviewing the SP Security ChecklistAPNIC
Are you ready for the next attack? Reviewing the SP Security Checklist, by Barry Green.
A presentation given at the APNIC 40 Opening Ceremony and Keynotes session on Tue, 8 Sep 2015.
Are you ready for the next attack? reviewing the sp security checklist (apnic...Barry Greene
Rethinking Security and how you can Act on Meaningful Change
What the industry recommends to protect your network is NOT working! The industry is stuck in a dysfunctional ecosystem that encourages the cyber-criminal innovation at the cost to business and individual loss throughout the world. We do not need a “Manhattan Project” for the security of the Internet. What we need are tools to help operators throughout the world ask the right question that would lead them to meaningful action. Security empowerment must empower the grassroots and provide the tools to push back on the root cause. This talk will explore these issues, highlight the dysfunction in our “security” economy, and present “take home” tools that would facilitate immediate action.
Modern Web Security, Lazy but Mindful Like a FoxC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hYU0cd.
Albert Yu presents a few viable, usable and effective defensive techniques that developers have often overlooked. Filmed at qconsf.com.
Albert Yu is currently working as a principal engineer for the Trust Engineering team in Atlassian. He has spent 15 years exposing himself to many different aspects of a security program, including security engineering, R&D, product reviews, code review, penetration test, governance and compliance, risk management, incident response, in large scale environment.
Akhtar Hossain: AWS San Francisco Startup Day, 9/7/17
Architecture: Manual vs Automation:
When to start automating your processes - There is a breaking point for every process where investing the time to automate it will outweigh the time spent manually doing it. Taking on these tasks too early can divert resources where they could be better allocated. We’ll look at how to determine if and when the right time to automate a process is. We’ll cover automation in many forms including building APIs, Chat-ops, Config Management, and even the value of highly customized keyboard shortcuts.
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
This talk looks at the evolution of monitoring over time, the ways in which you can approach monitoring, where Prometheus fit into all this, and how Prometheus itself has grown over time.
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
Monitoring can mean very different things to different people, and this often leads to confusion and misunderstandings. There are many offerings both free software and commercials, and it's not always clear where each fits in the bigger picture. This talk will look a bit at the history of monitoring, and then into the general categories of Metrics, Logs, Profiling and Distributed tracing and how each of these is important in Cloud-based environment.
Video: https://www.youtube.com/watch?v=hCBGyLRJ1qo
The Present and Future of Serverless ObservabilityC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2EU8a5z.
Yan Cui overviews the challenges observing a serverless architecture, the tradeoffs to consider, the current state of the tooling for serverless observability, taking a look at new and coming tools. Filmed at qconlondon.com.
Yan Cui is Senior Developer at Space Ape Games. He has been an architect and lead developer with a variety of industries ranging from investment banks, e-commence to mobile gaming. In the last 2 years he has worked extensively with serverless technologies in production, and he has been very active in sharing his experiences and the lessons he has learnt.
Penetration Testing involves a lot of repetitive manual processes. This includes the execution of a multitude of security tools. These are traditionally executed based upon the analysis of an analyst over the duration of a vulnerability assessment. Automating a heuristic process allows an attacker additional resources for more valuable tasks through the automation of the acquisition, execution and information collection process.
A tool framework was developed by over the last few months effectively gluing over 30 unique security tools together. Each of these tools are selectively executed based of your targets available networked services dynamically.
The tools include a collection of open source, custom and commercial software with varying licensing requirements.
Hacking With Glue ℠
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
Often what you monitor and get alerted on is defined by your tools, rather than what makes the most sense to you and your organisation. Alerts on metrics such as CPU usage which are noisy and rarely spot real problems, while outages go undetected. Monitoring systems can also be challenging to maintain, and overall provide a poor return on investment.
In the past few years several new monitoring systems have appeared with more powerful semantics and which are easier to run, which offer a way to vastly improve how your organisation operates Prometheus is one such system. This talk will look at the monitoring ideal and how whitebox monitoring with a time series database, multi-dimensional labels and a powerful querying/alerting language can free you from midnight pages.
Similar to Codemotion Milan 2015 Alerts Overload (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Codemotion Milan 2015 Alerts Overload
1. MILAN 20/21.11.2015
Alert overload: How to adopt a
microservices architecture without being
overwhelmed with noise
Sarah Wells - Financial Times
@sarahjwells
46. Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
47. Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
each check will return "ok": true or "ok": false
48.
49.
50. Synthetic requests tell you about problems early
https://www.flickr.com/photos/jted/5448635109
82. See if you can improve it
www.workcompass.com/
83. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
84. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
85. …
Technical Impact
The server is experiencing service degradation because of
network latency, high publishing load, high bandwidth
utilization, excessive memory or cpu usage on the VM. This
might result in failure to publish articles to the new content
platform.
86. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
87. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
88. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Two years ago, I started working on a new project at the FT, rebuilding our content platform and APIs. We're using a microservice architecture.
I'm here to talk about what it's like to move from monitoring a monolithic application to monitoring a whole lot of microservices.
Which is also about what it's like to start doing devops, because when you are building new microservices whenever you need, and throwing them away when they stop being useful, you can't do a handover to a separate operations team each time: it takes too long.
So you are going to be supporting your services and the pain that used to be felt by operations when you didn't get monitoring and alerting right, is now being felt by you…
I'm guessing a lot of people in this room have been on a support mailing list at some point, so this probably looks familiar.
Too many emails, and very hard to work out what they really mean.
The bad news is ...
I saw this recently and it made me laugh.
BUT - there are lots of things I really like about microservices!
It's easy to reason about the logic within a microservice
it's easier to deploy small changes both quickly and reversibly,
it's easy to change your architecture, and once you have,
it's easy to remove the code you don't need any more, because it's all in one service and you can check that nothing is calling it via the access logs for the service…
So I don't want to go back to writing monolithic applications - but I do think that monitoring is harder for a microservice architecture.
So why is that?
Firstly, instead of 1 service, we have 45
We currently have Integration, Test and Production environments.
There's some debate about whether we need three and other teams at the FT only have production
We have at least 2 instances, for resilience, and sometimes more.
And at the moment, each of those is on it's own VM
These are system checks - disk space, CPU load, NTP, DNS
Most of the checks run more often than every 5 minutes in fact
Which means you get alerts for unlikely and transient issues all the time.
Earlier this year, a new developer joined our team, and he couldn't believe the number of alert emails we were getting. He started counting.
And that's on average.
When shared infrastructure goes wrong, for example if system time isn't being properly synchronised or someone accidentally switched off a DNS server, if you're monitoring it from every server EVERYTHING lights up
As an example, we use puppet to automate server setup and deployment - and we had 20000 alert emails overnight for a PLANNED failover of our puppet master
But it's not just system monitoring that is painful...
We started out creating alerts and monitoring a lot like we did for monolithic applications:
alerts based on response time
alerts for ERROR logs or responses that are server error status codes - 500s for example
First off, where in a monolith you were calling a function, now you're making an http request which means there are more things that can go wrong
If one thing fails...
You'll get an alert from the service using it...
But if you're naive in the way you set up alerts, you'll also get an alert from anything calling THAT service
Getting alerts from multiple services can also make it difficult to find the cause
And when things DO go wrong...
This is what it feels like
…
You need to be able to support your system, which means you need to sort out your monitoring and alerting.
...
It was clear this was causing us problems, especially when we looked at the numbers : with the system and functional monitoring alerts added together, that's one every 5 minutes
so with the support of our Product Owner, we took some time to work on this.
We have a thing we do at the FT called a Quickstart - we take a small team, maybe from several different projects or skillsets, and we put them in a room together
No specific requirements, no backlog - just a topic of interest.
From feedback, it's apparently very important that free coffee and biscuits get delivered twice a day…
In this case - we focussed on alerts and how to make them more useful and rescue our email inboxes
(There's more details on this on our Technology blog, the Engine room)
As a result of this I can tell you about three principles that helped us to reduce the number of alerts and spend less time responding to false alarms and confusing information
We got some things right, and I'll cover those later
What we got wrong is that we created far too many alerts without thinking about why we were doing it… it was just another thing on the checklist - create an alert.
The problem is, you probably don't care about these alerts.
I mean, how much do you care about NTP issues in non-production environments?
But more importantly, you don't care about response times or errors where a service is just passing on what it got from lower down the stack
27. It's the business functionality you care about
Not the individual microservice.
For example, we are responsible for publishing FastFT posts - if that widget on the right on our site home page stops getting the latest updates, we will hear about it
So that's what our alerts should be focussing on
So to tell you what's important to us, I need to tell you a bit about our system...
This is a logical view of the Universal Publishing Platform
multiple source content management systems, sending us articles, blogs, images, vidoes etc
when content is published, it's transformed into a common format
and annotated using a concept extraction pipeline
we also have metadata taxonomies like organisations, people, memberships, all loaded in
then there are APIs to get content and metadata about content
articles about Apple -> Information about Apple -> Information about Tim Cook -> Other companies he's involved with, etc. etc. etc
Architecturally, we have a mix of Go and Java/Dropwizard apps. We use Kafka to send messages about events. We have GraphDB and Mongo data stores.
So what is our key business functionality?
1. Publishing and transforming content
2. Annotating that content - i.e. working out which companies an article mentions, or what person it's about
3. Loading updates of our data about organisations, people, etc
4. Making all that information available via APIs
But it's not the same things we care about for each...
We want to know about every failure, because each failure is a story that our customers can't read yet
Our alert should make it clear we've failed to publish something, AND what needs to be done to fix it
For publication, there aren't that many events a day - maybe 600. We can look at individual events.
For our APIs, we have 2.8 million requests a day at the moment, a little over 30 a second.
So we look at 95th and 99th percentile response time, for example, to make sure they're ok.
It doesn't have to be super fast, but it definitely can't be super slow
But we don't JUST care about speed...
i.e. did something go wrong.
The obvious thing to look for is server errors - something has gone wrong somewhere in our stack.
The graph here shows when some of our blades failed in a data centre. This is for some business functionality that's not critical at the moment, so we are comfortable with all the nodes being in the same data centre, in case you're wondering why a blade failure would break things!
The sudden increase in 500 errors triggered our alerts so we knew about this really quickly.
However, we also look for client errors
a sudden increase in 400 errors, i.e. bad requests, could be your fault. We've made changes that turned out to break our API contract - e.g. POST requests suddenly needed to have Content-Type header application/json. Meets http spec, but is less lenient, and so BAD. We would want an alert for that.
We have built in back off and retry for recoverable errors
Sometimes the first request fails, and the second one succeeds. We don't want an alert in that case.
We might want a report, so we know we have a flaky connection. Or we might just accept that our network is evil.
Otherwise, it's just noise
Your alerts should be something you don't mind being interrupted about
You can go look at it whenever you want.
I bet you won't look at it as often as you think you will
We got rid of:
all our publish microservice-specific response time alerts
all our microservice-specific error alerts
and made the most interesting ones into real-time dashboards
Now your alerts really mean you need to react, make them unmissable.
This means they need to attract the attention of the people that need to react. How you do that depends on your team and your working practices
We have an 'Ops Cop', and take it in turns to do that role for a week. The ops cop will also take on small pieces of work, tidying up, refactoring - things that don't need you to be in flow (because you WILL get interrupted)
Anyone reading the alert should be able to work out:
what it actually means
the action they need to take
who to talk to if they get stuck
Use clear language and don't be vague.
Add a link to explanatory information (panic guide) - this needs to be clear too, and needs to be reviewed by someone who may have to use it but didn't write the service (e.g. new team members who've never had to look at this service/operations)
Consider how to make "future you's" life easier:
here's a search link to show you the whole transaction
here's a jenkins job to republish
Our transaction IDs are adding to logs using MDC (Mapped Diagnostic Context)
Every microservice we write needs to checked for a special X-Request-Id header (we do this via a Servlet Filter) then add it to the thread context. Any requests over http must pass on the X-Request-Id header too.
This means all logs for a particular user request will have a unique identifier logged and we can look at everything that happened when an article was published or a read request was made
We have an FT standard for healthchecks - you must return a particular json response on a particular endpoint.
You return 200 for unhealthy as well - there was some debate about this, the logic is that a 500 indicates that the healthcheck can't be run, which is different from it failing
You have to look at each check to work out whether you have any failing checks
This is what the json looks like
There's a chrome plugin to make it look nicer for humans
You want to know about problems before they affect your customers, if possible.
We started off with synthetic publication requests.
Synthetic publication takes a known, old article, and publishes it every minute.
If this breaks, we can fix it before a single real publish fails.
By basic, I mean standard
A puppet based framework
goal: for developers to reliably build & deploy services* from "zero-to-customer" in less than 15mins.
... across data centres, with monitoring...
supports multiple IaaS providers
digression:
some debate about FT Platform internally, some teams aren't using it: heroku or 'naked' AWS
personal opinion: bootstrapped this type of deployment at the FT, and at the time most developers weren't that familiar with the underlying tools, but if you are already familiar with heroku and AWS, it can feel like you're being restricted
we're now evolving FT platform to reflect that, with a move to CloudFormation and an internal tool called Konstructor that provides an API wrapper round a lot of our other tools
however:
gave us monitoring and log aggregation for any new microservice with no additional effort
nagios monitors system metrics, network protocols, applications, services, servers, and network infrastructure
alerts via email or (god forbid) SMS when there are failures and when the service recovers
you can acknowledge alerts to stop the notifications
put into maintenance mode for known downtimes
Every VM set up using FT Platform automatically forwards logs to Splunk.
Any queries you want to do across all hosts in a service, or all services that take part in a particular event is easy to do without having to jump onto the relevant box
We use it to
identify problems and alert
visualise performance or load
create dashboards for particular services
But more recently, we're moving away from Splunk dashboards..
And instead we're graphing our metrics using Graphite and Grafana.
We're using Dropwizard for our Java apps and that comes with codahale metrics embedded. It's a small config change to write those metrics to a graphite server...
Graphite isn't particularly pretty - you can see all the metrics and compose graphs on the fly...
... but by using Grafana on top of it you can easily create beautiful custom dashboards...
.They're quick to load as well
This shows one of our Read API components, so we're interested in server errors, client errors, successful requests
And also request rate across hosts.
Interesting - here, the traffic started to switch over from one data centre to another, I have no idea why!
We were using Splunk to pick up ERROR level logs
The problem there is that every ERROR results in an alert.
You might be more hardcore about this than me, but unless you have zero tolerance of ERROR logs, there will be times when there are some errors that aren't a priority - they don't represent a major issue and there aren't that many
We got some of those from the client we use to talk to Kafka
We were ignoring them and missed a problem someone introduced that also caused ERROR logs.
That wouldn't happen in Sentry or equivalent tools, because each new error TYPE results in an alert.
Again, sending information for a Dropwizard app to Sentry is a simple configuration to send logs out to the sentry API
OK, so that's the basic tools...
If the basic tools aren't giving you what you need, build your own.
This is easier if those basic tools have good APIs - because you can create your own view easily
Our first 'extra' tool was created by one of our integration engineers - he turned up with it one day…
SAWS
Built using Blinky tape - a programmable LED strip
Each section represents a different part of our system
Things light up when there's a problem, and when there isn't a problem, the blue lights swoosh back and forth so you know the monitoring is still running.
It used to be really cool and run on a Raspberry Pi - it's a Python script - but that broke and now it runs on an old Windows box under someone's desk.
So why did Silvano create this?
First off, frustrations with the number of emails...
Which he was sending straight to the bin...
And secondly, frustration with monitoring screens
He wanted something that was easy to instantly see if there was a problem
This is SAWS up in our office.
It's pretty simple - red indicates something bad has happened.
and he also changed from green to blue after this to make sure everyone can see if there's a problem…
It's not really this bright :)
So that was our first tool. Our second tool addresses the problem of waiting for screens to cycle through to see the one you want to see - by providing a single screen that can tell you what you need to know...
Dashing is a Sinatra based framework that lets you build beautiful dashboards.
Originally built by Shopify for showing things on monitors around the office
Adopted by the FT - lots of things we care about are very easy to add as tiles:
nagios (monitoring)
jenkins (build and deployment)
pingdom (website monitoring)
And it's not hard to add a new widget to integrate another system.
This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
Dashing is a Sinatra based framework that lets you build beautiful dashboards.
Originally built by Shopify for showing things on monitors around the office
And this is the FT's dashboard of everything
We have a duty ops team who are first line support. They using Dashing very heavily and they'll ask things not surfaced on dashing to be added
These tiles are arranged by service level, so the most critical systems are top right, with a platinum border
Bottom right have a bronze border - these are much more a case of 'best endeavours'
We have dashing screens up in our area now - it's enough to let you know there's an issue, and it can give a bit more granularity than the big flashing light thing
Nagios chart gives us the last 24 hours history for each Nagios monitor.
Means if we have intermittent errors that happen a lot, we don't miss them. And if something big happens when we're not there, we still know about it
So how does it work?
It screenscrapes Nagios for status - this is what that information looks like on nagios.
Nagios chart pings this regularly and keeps the information in memory for 24 hours (we go back that far as it lets us see what happened overnight, plus that was the limit before having to store it somewhere other than memory)
Each line is a service - in this case, it's all the services in Production on AWS for one of our teams
The name of the service, and of each check that failed, are shown on the left.
The bars on the right show the status at any point.
All failures are 'soft' failures - e.g. we don't wait for 3 failures to happen before indicating there was a problem. This allows us to see intermittent issues (but probably results in some noise)
YELLOW: WARNING status - a minor failure - e.g. a check took slightly over the max time to respond
RED: CRITICAL status - a major failure, i.e. no response for a check
BLUE: ACKED state
So here you can see a large data load happening that put strain onto all our servers - they were in a flapping state for hours. At some point, people started acknowledging the alerts
This one is worse. We had major problems in our Test environment - our graph database fell over. everything that had anything to do with graphs pretty much went down.
As it's Test, there was less acknowledging going on
Here's two major problems, one after the other - the pink vertical lines show when nagios chart couldn't connect to nagios, this was down to packet loss on our network.
The red bars were a firewall upgrade, eventually rolled back. Again, this is Test.
Nagios chart works because it uses the human ability to make sense of patterns - we generally know when things are going wrong just out of the corner of our eye
If viewed on your browser, pixel mapping takes you RIGHT to the error in nagios
It's been successful - individual teams picked it up and it's been adopted by our platform and environments team, to make it available more generally at the FT.
If it sounds interesting, let me know - it's not open sourced yet.
…
So the final comment on tools is about the tools you use for communication...
That's probably a bit harsh..
But it's certainly not email for me.
Even if you get the numbers down to a manageable level, threaded view isn't good for alerts - and it's hard to work out what they mean from this view (I realised after I took this screenshot that these aren't even alerts for my system - another team copied config and sent us all their alerts for a while)
And we are moving away from email for team communication at the FT…
We're using Slack a lot - most people have a Slack client open.
Slack has great integration tools
webhooks let you call an http endpoint and post a message
email integration fits well with existing tools - anything that can send an email can send a Slack message
One of my colleagues tried to persuade me to set up a separate channel for our alerts, not using the main team channel.
I think that's effectively saying "Put it somewhere where I can ignore it"
If you are getting so many of these alerts that it's annoying, there are two things you can do:
tune the alert (e.g. API requests, increased number of failures in a ten minute period so we tend to get this alert for real issues not network blips)
fix your broken system
One other thing I'm trying to persuade people to do is use Slack reactions to show that you've picked up an alert, and fixed it
I read that editorial teams are using Slack like this to move content through a workflow.
We tend to reach with a tick where we fixed something, with 'eyes' if we're looking into it still
But the problem I have is the creativity of developers - I have to ask people what they mean by a dancing lady...
If you put screens up that are clear in what they are showing - you'll notice when things go wrong
Non-developers on the team will also notice and tell you something's started flashing
Don't loop between screens - put something up that tells you what you need to know. Have more than one screen!
You have to keep a focus on them or they start to get untidy
Did you do something as a result of getting it? If no, delete it
Language should be clear - avoid jargon
Get rid of typos
Link to useful documentation
Get your newest developer to read it
Get someone from another team to read it
This is text for an email alert based on looking at access log response times
First of all - what a really developer title for the alert: no spaces, categorised by how often it runs rather than what it means
Next up - it MIGHT result in articles not getting published.
I want to know if it DID result in articles not getting published. Also - the business doesn't care about my Methode API microservice (which is a microservice wrapping calls over CORBA so that most people don't have to deal with CORBA)
But our alerts also have a Technical impact section...
I have no idea why we decided these and only these were the reasons for slow response times. It doesn't help me work out which of these is currently the issue.
First of all - spaces in the title!
This is better - at least I can tell that it's a publish failure, from our Methode CMS.
And I can see which articles failed.
And I can go and look at a run book for more information - in fact, the run book links to somewhere (actually a Jenkins job) where you can enter the list of UUIDs and kick off a republish process. (yes, could be automated, but sometimes you want to check it's not going to fail the second time, e.g. editors use their systems in ways we didn't predict)
All of which make it much less annoying to have to deal with an alert.
This alert goes to some people in our editorial department, so they can check status and republish
So whenever you get an alert, really look at it
If someone had to come and tell you your system is broken, you probably need to find a way to know first the next time
Although… for some things, a slack channel that people know about is pretty good
Maybe you need to create a synthetic request, or add the right logs and create a Splunk alert
Here - something that picked up when the percentage of failures increased told us we had a problem
We've had a case where the integration that tells us when an article is published broke
Our monitoring starts from that notification
We found out via manual testing 3 days later
We asked the CMS team to add their own monitoring - but we also added a brute force test ourselves - "did we see any blog publishes in the last day?"
I managed to turn off our publication failure alerts because I "improved" some logging
We worked this out when part of our data centre went down but we didn't see these alerts firing
If your log entry is the basis for an alert, add a unit test that will fail if it's changed and explicitly says what the impact is
Maybe you take down one of your systems and check that you can tell the impact from the alerts you get
Maybe do this blind and see how quickly the Ops Cop can work out what's broken (we haven't done this but I'd like to)
Or you can take part in company exercises - the FT took down one of our data centres earlier this year.
We'd built up to it with smaller tests, did it an an agreed date, and made sure the right people were available. Crucially, every issue we found was worked through.
We turned off a different data centre last weekend.
For us as developers, a few weeks before we started thinking about what might happen.
We KNEW we didn't have resilience for one part of our system as part of a phased approach to delivery.
However, when we started to think about what was going to happen, we found several unexpected reasons why we weren't going to have a working system (bad configuration, mostly) - we had those fixed before the day.
Netflix have their Chaos Monkey for testing resilience by randomly killing instances and services (in fact they have an entire Simian Army to test resilience at different levels)
The FT has its own Chaos Snail. If you're wondering why it's called that, it's smaller-scale than the chaos monkey, and it's written in shell
This runs on a virtual machine, kills processes as root, and records its work. It's a good way to see if your alerts are working.
It needs to be at least as available as the system it's monitoring
This is something that took us a while to really get to grips with. But if the monitoring system is down, you have no idea what the state of your system is.
…
So that's it from me in terms of advice, so I guess the question is...
Zero emails from Nagios - we have our inbox back!
We rely on our other tools
We can't miss them. They are genuine alerts
So we can see how we're doing on response times and error rates at any point
There are lots of good reasons to do that
But realise what it means to support them
Think about it from the start
Make sure you have the right tools
Continue to cultivate your alerts
Our company page on Stack overflow describes our technologies and the culture of our Technology department
We also have a technology blog where we talk about some of the things we're trying out
We have lots of our code on github and are doing this more and more