THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

•Download as PPTX, PDF•

0 likes•75 views

Have you ever felt you took every wrong turn possible in the process of mitigating a production incident? Did you go through a 3-hour hell during incident response and felt the incident wasn’t complex enough to justify the horrors you’ve experienced? Did it cause you to question your engineering or problem-solving skills? Well, it’s only partially you. Our brain is wired to make decision-making simpler. In doing so, it exposes itself to biases, heuristics, and other quirks that may seem like “bad decisions” in hindsight. In this talk, through real-life outages, we’ll project those psychological principles onto the world of production monitor, and incident management. As a responder, you’ll learn why those behavioral patterns emerge during production incidents and what can be done to limit their effect, and as a manager, you’ll learn how to enable and encourage a healthy environment to better support those patterns.

Technology

The (ir)rational
incident response
The psychology behind production incidents
November 2021

Boris Cherkasky
➔ Backend engineer and Production
advocate @Riskified
➔ I 🤍 Observability
➔ Scuba Diver
About me
@cherkaskyb on twitter / linkedin / medium

Agenda
01 The psychology of
an incident response
02 Intro to cognitive biases
and heuristics
03 Biases in production

Riskified by the numbers
Global team,
nearly 50% in R&D
Countries across
the globe
Online volume
reviewed in 2020
650+ 180+
$60B+
50+
Publicly held companies
among our clients
98%+
Client retention
for the past 2 years
As of August 2021

The Anatomy of production incidents
Time
Certainty

Heuristics /
Cognitive Biases
Mental “shortcuts”

A Radio Commercial
Loan at an interest rate of 0.5%
lower than the bank’s

The Anchoring Bias
Loan at an interest rate of 0.5%
lower than the bank’s

Real life,
production,
Heuristics
and biases

Processes Monitoring
Alerting
Design
Optimal
decision
making

Business process
Step 1 Step 2
External Data
source 1
External Data
source 2
Internal Data
source 1
Internal Data
source 2
Step 7
Final
Step

Business process
Step 1 Step 2
External Data
source 1
External Data
source 2
Internal Data
source 1
Step 7
Final
Step

Prioritize your SLIs
Mitigating the Analysis Paralysis
Latency
Availability
Data integrity
Accuracy

Mitigating the Analysis Paralysis
Latency
Availability
Data integrity
Accuracy

A math lesson
Alert #8973: Latency error:
avg(avg_over_time(latency))
+ 2 * stddev(avg_over_time(latency)) > 18

Alerts and metrics should be set
by “the common” responder,
mentored by the expert
Mitigating Curse of Knowledge
When complex alerts can’t be
avoided - document, explain, train,
level UP your organization

Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration Configuration

Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration Configuration
403 - Forbidden

Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration
403 - Forbidden
Configuration

Mitigating the Simulation heuristic
Set the responder on the correct path as
soon as possible, with minimal friction
Minimize the time
to start triage

CPU usage Available Memory
7:30 8:00 7:30 8:00

7:30 8:00 7:30 8:00
2% 200MB
CPU usage Available Memory

Show simple and standardized data
Mitigating the confirmation bias

● Don’t work alone
● Draw a concrete line between the observed
facts, your hypothesis, and the existing state
(outcome/outage)
Mitigating the confirmation bias

Cheatsheet
Keep anything
production simple
Specific alerts,
Standardized dashboards
Prioritise SLIs
(SLI pyramid)
Normalize production
status with the
“average” responder

Boris Cherkasky
cherkaskyb@gmail.com
@cherkaskyb / Twitter / medium / linkedIn
Thank You
For Your Time!

What's hot

Scaling security in a cloud environment v0.5 (Sep 2017)

Dinis Cruz

Talk by: Magnus Lübeck This talk will discuss Icinga as the “one stop shop” for finding the “single truth of systems state”. KMG Group use a “four field” model when designing systems, where Icinga have an important place in a section called “technical monitoring/technical performance monitoring”. We touch two methodologies (MOPS – Metrics, Operational tools, Processes, and Metrics), and Ted Dziuba’s actionable response to monitoring events.

Efficient IT operations using monitoring systems and standardized tools - Ici...

Icinga

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Andreas Grabner

"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko

Fwdays

Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale

Elasticsearch

Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing

Andreas Grabner

Why AIOps Matters For Kubernetes

Timothy Chen

In DevOps we are used to talking about application velocity. But velocity without a framework is short lived and potentially creates more risk than benefit. Code-to-Cloud visibility is the practice of making sure engineering teams have visibility across the entire SDLC in depth and breadth. With code-to-cloud visibility organizations understand the impact of application development from feature definition to it running in production. Join Splunker Chris Riley as he explores: The importance of aligning application visibility with your application tech stack How to enable code-to-cloud visibility practices Deeper understanding of DevSecOps, Pipeline Analytics, and Observability

Code-to-Cloud Visibility: An Essential Framework for DevOps Success

JadeCampbell13

Automate Your Backups at Scale

Amazon Web Services

Chaos Engineering

Yury Roa

IoT in the Cloud: Build and Unleash the Value in your Renewable Energy System

Mark Heckler

Watch our full performance given by our team during the Big Data Technology Warsaw Summit: https://www.youtube.com/watch?v=CBrq7z8ikaM The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016). In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind Watch our webinar here: https://www.youtube.com/watch?v=CBrq7z8ikaM Speakers: Rafał Małanij Rafał Zalewski Linkedin: https://www.linkedin.com/in/rafalzalewski/ ___ Company: Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Managing Big Data projects in a constantly changing environment - Rafał Zalew...

GetInData

Splitting the Check on Compliance and Security

Jason Chan

The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.

Finding bad apples early: Minimizing performance impact

Arun Kejariwal

Agile Entwicklung, das hat was von Leichtigkeit, und agile Entwicklung trägt tatsächlich dazu bei, bessere Ergebnisse früher zu erzielen. Es gilt aber auch: Der Prozess ist stikter, als Sie das heute leben. Scrum ist strikter gegenüber Management, und erfordert einen funktionierenden Integrations- und Testprozess. Vor allem in Embedded Systemen. Mehr zum Thema auf http://www.elektrobit.com/consulting Agil klingt innovativ, fluffig, ebenso Scrum mit seinen User Stories, dem Zusammenarbeiten. Das agilenPrinzip, Änderungen willkommen zu heißen, lässt das höhere Management in Entwicklungsorganisationen und deren Kunden in freudiger Schnappatmung erzittern. Verheißt dies doch die Möglichkeit, ohne sauer dreinblickende Projektmanager auch zwischendrin mal das Ruder rumzureißen, und das ganze auch noch mit dem Segen des Agilen Manifests und mit einem Framework namens Scrum. Wessen Organisation agil entwickelt, der freut sich, innovative Methoden zu verwenden, mit schlanken Prozessen und kaum Overhead, und vor allem gleich damit starten zu können. Echt jetzt? Wenn Ihre Organisation mit Scrum wirklich erfolgreich sein will, dann ist das kein Zuckerschlecken. Der Prozess ist mit hoher Wahrscheinlichkeit deutlich stikter, als Sie das heute leben. Scrum ist strikter gegenüber dem Management außerhalb des Teams, und Scrum ist strikter und zu Beginn der Einführung anstrengender als Ihr heutiger Integrations- und Testprozess. Vor allem in Embedded Systemen. Der erste agile Wert sagt „Individuals and interactions over processes and tools.“ Und um eben im täglichen Arbeiten den Menschen und Interaktionen eine höhere Bedeutung einräumen zu können, ist es so wichtig, dass Prozess und Tools vorhanden sind, funktionieren und ganz natürlich benutzt werden. Scrum, Kanban und andere agile Methoden erlauben den Menschen in Organisationen, streßfreier hervorragende Qualität zu liefern, erlauben internen und externen Kunden, früher werthaltige Lösungen zu bekommen, und machen aufs große Ganze gesehen der eigenen Organisation bessere Planung. In diesem Vortrag sehen Sie, was Sie zum Agilen Entwickeln mit Scrum lieber nicht sehen wollten, und wie Sie endlich einen Nutzen fürs Projekt, fürs Unternehmen, für den Kunden und für die Mitarbeiter aus Scrum ziehen können.

Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...

Joachim Schlosser

use case ibm k8s_service+devops

Shoichiro Sakaigawa

The digital landscape within Dutch government is a complex and heterogeneous mix of technologies. Within this scenario, Capgemini is tasked with continuous integration and maintenance of key infrastructure. The results connect major organizational parts of the country with a large volume of daily traffic. To keep the lights on in operation and allow for quick turn-around times, Elastic is the dominant choice for generating reliable insight. It facilitates a thorough insight into the inner workings of modern amalgamated java deployments, databases and legacy systems spanning a multitude of decades.

Capgemini: Observability within the Dutch government

Elasticsearch

Big Data - Hadoop and MapReduce - Aditya Garg

Agile Testing Alliance

Elastic APM: Amping up your logs and metrics for the full picture

Elasticsearch

Using Kubernetes for orchestration? Great—we hope things are running smoothly. The thing about Kubernetes, though, is that it tends to surprise you—throwing curveballs just when you think you've finally mastered the art of container management. And those curveballs usually come at you when you try to scale up. So, how can you scale K8s without striking out due to speed and reliability (not to mention sanity) issues? Join Guy Menahem, solution architect at Komodor, as he shares some of the hard lessons he learned from his own experiences with Kubernetes, including: - The “phantom” challenges of working with K8s - The five key factors you can’t ignore when scaling K8s - Best practices and tools for a smooth-running K8s system

5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...

Komodor

What's hot (20)

Scaling security in a cloud environment v0.5 (Sep 2017)

Efficient IT operations using monitoring systems and standardized tools - Ici...

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko

Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale

Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing

Why AIOps Matters For Kubernetes

Code-to-Cloud Visibility: An Essential Framework for DevOps Success

Automate Your Backups at Scale

Chaos Engineering

IoT in the Cloud: Build and Unleash the Value in your Renewable Energy System

Managing Big Data projects in a constantly changing environment - Rafał Zalew...

Splitting the Check on Compliance and Security

Finding bad apples early: Minimizing performance impact

Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...

use case ibm k8s_service+devops

Capgemini: Observability within the Dutch government

Big Data - Hadoop and MapReduce - Aditya Garg

Elastic APM: Amping up your logs and metrics for the full picture

5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

2020 KringleCon HolidayHack Report - Brazzell

Curtis Brazzell

No movie is complete without an inspiring hero and a thrilling conflict. In this blockbuster, the hero is a hard-working small startup developer who adds some managed solutions into their stack. The conflict arises six years later when the company is successful, and that same developer is sitting in endless meetings with engineering, product, and finance, trying to figure out what is burning a hole through the infrastructure budget. But this doesn’t need to be a nail-biting cliffhanger, with an understanding of basic assumptions and increased awareness, this action-packed thriller can become an informative documentary explaining how the product is so cost-efficient. This talk is focused on developers and will explain the pitfalls and the possible misuses of managed datastore solutions in large software systems with suggestions on how to minimize spending on them.

Over pay as you go for your datastore

BorisCherkasky1

In today's cloud era, admins struggle to keep their IT infrastructures safe. Cloud security is joint responsibility and what we need is a new approach! In this session, you will learn how to securely deploy and maintain Azure infrastructure solutions, why automation is essential, what network security and encryption options you have, and how access control can prevent you from having sleepless nights. We will successfully attack an Azure environment live on stage, dive deep into Azure Security Center, and see how we can use it to ultimately secure IT infrastructures on premises, hybrid, and on Azure.

Cloudbrew 2019 - Azure Security

Tom Janetscheck

AI, Blockchain and Quantum - Moonshoot for the Enterprise !?

Thorsten Schroeer

Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...

Jai Natarajan

Aberdeen Group Presents: Video Intelligence to Secure and Grow

3VR Inc.

3 Reasons Why IT puts Us at Risk by Phil Godwin

Clear Technologies

Elastic's recommendation on keeping services up and running with real-time vi...

FaithWestdorp

https://www.brighttalk.com/webcast/14723/234829?utm_source=Compliance+Engineering&utm_medium=brighttalk&utm_campaign=234829 : With cyber attacks on the rise, securing your data is more imperative than ever. In future, organizations will face severe penalties if their data isn’t robustly secured. This will have a far reaching impact for how businesses deal with security in terms of managing their cyber risk. Join this presentation to learn the cyber security controls prescribed by regulation, how this impacts compliance, and how cyber risk management helps CISOs understand the degree these controls are in place and where to prioritize their cyber dollars and ensure they are not at risk for fines. Viewers will learn: - The latest cybercrime trends and targets - Trends in board involvement in cybersecurity - How to effectively manage the full range of enterprise risks - How to protect against ransomware - Visibility into third party risk - Data security metrics

Cyber Risk Management in 2017: Challenges & Recommendations

Ulf Mattsson

What happens when the (Observe) Plan-Do-Check-Adjust cycle is undermined by lapses in data integrity? Observations are questioned. Plans may be ill-conceived. Actions may be undertaken that undermine rather than enhance. “Checks” can fail. Adjustments may be guesswork. In cybersecurity, the results of poor data integrity can be expensive outages, ransom requests, breaches, fines -- even bankruptcy (think Cambridge Analytica). But data integrity issues take many forms, ranging from benign to malicious. The full range of these issues is surveyed from a cybersecurity perspective, where logs and alerts are critical for defenders -- as well as quality engineers . Techniques borrowed from model-based systems engineering and ontology AI to are identified that can mitigate these deleterious effects on PDCA.

The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...

Mark Underwood

Csec 610 Motivated Minds/newtonhelp.com

amaranthbeg52

Csec 610 Extraordinary Success/newtonhelp.com

amaranthbeg112

Csec 610 Education is Power/newtonhelp.com

amaranthbeg72

Csec 610 Your world/newtonhelp.com

amaranthbeg92

A Review of deep learning techniques in detection of anomaly incredit card tr...

IRJET Journal

Accidental Techies Half Day Session

Miles Maier

Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017

sandhibhide

Managing Risk in Information Systems Powered by vLab Solution s JONES & BARTLETT LEARNING INFORMATION SYSTEMS SECURITY & ASSURANCE SERIES LABORATORY MANUAL TO ACCOMPANY VERSION 2.0 INSTRUCTOR VERSION Copyright © by Jones & Bartlett Learning, LLC, an Ascend Learning Company - All Rights Reserved. 49 Introduction Identifying and assessing risks is challenging, but treating them is another matter entirely. Treating risks means making changes based on a risk assessment and probably a few hard decisions. When treating even the most straightforward of risks, practice due diligence by documenting what steps you are taking to mitigate the risk. If you don’t document the change and the reasoning behind it, it’s possible that your organization could reverse the mitigation and reintroduce the risk based on the notion of “but that’s how we always did it before.” After you’ve addressed a risk, appoint someone to make certain that the risk treatment is being regularly applied. If a security incident arises even with the change in place, having a single person in charge will ensure that any corrective action aligns with the risk-mitigation plan. You’re not appointing someone so you can blame that person if things go wrong; you are instead investing that individual with the autonomy to manage the incident effectively. The purpose of a risk-mitigation plan is to define and document procedures and processes to establish a baseline for ongoing mitigation of risks in the seven domains of an IT infrastructure. In this lab, you will identify the scope for an IT risk-mitigation plan, you will align the plan’s major parts with the seven domains of an IT infrastructure, you will define the risk-mitigation steps, you will define procedures and processes needed to maintain a security baseline for ongoing mitigation, and you will create an outline for an IT risk-mitigation plan. Learning Objectives Upon completing this lab, you will be able to: Identify the scope for an IT risk-mitigation plan focusing on the seven domains of a typical IT infrastructure. Align the major parts of an IT risk-mitigation plan in each of the seven domains of a typical IT infrastructure. Define the tactical risk-mitigation steps needed to remediate the identified risks, threats, and vulnerabilities commonly found in the seven domains of a typical IT infrastructure. Define procedures and processes needed to maintain a security baseline definition for ongoing risk mitigation in the seven domains of a typical IT infrastructure. Create an outline for an IT risk-mitigation plan encompassing the seven domains of a typical IT infrastructure. Lab #6 Developing a Risk-Mitigation Plan Outline for an IT Infrastructure Copyright © by Jones & Bartlett Learning, LLC, an Ascend Learning Company - All Rights Reserved. 51 Copyright © 2015 by Jones & Bartlett Learning, LLC, an Ascend Learning Company. All rights reserved. www.jble.

Managing Riskin InformationSystemsPowered by vLab Solu.docx

jessiehampson

Risk management is defined as anticipating and responding to an event not yet part of the plan of record that requires a significant adjustment in your demand-supply network (DSN). COVID-19 has put risk management at the front and center in the supply chain world. Join this incredible webinar with Dr. Ken Fordyce, Solutions Director, Supply Chain, and Advanced Analytics at Arkieva, and learn about the ongoing challenge of implementing rapid intelligent response (RIR) to improve organizational performance with better decisions.

Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...

Aggregage

An attack surface comprises of numerous vulnerable points through which an unauthorized user can gain access to the whole IT infrastructure. Minimizing the attack surface is the fundamental security strategy essential for preventing cyber attacks. To identify and remediate the potentials risks present in the organization IT assets, crucial attack surface reduction processes like vulnerability assessment, risk assessment, and risk priorization must be continuously implemented in the network. Automating these processes and managing them all from a centralized console will further reduce delays and speed up the risk mitigation process. In this webinar, you will learn - - About Attack surfaces and risks - Strategies to minimize the attack surface - Methods to speed up risk mitigation

Cybersecurity Strategies for Effective Attack Surface Reduction

SecPod

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech (20)

2020 KringleCon HolidayHack Report - Brazzell

Over pay as you go for your datastore

Cloudbrew 2019 - Azure Security

AI, Blockchain and Quantum - Moonshoot for the Enterprise !?

Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...

Aberdeen Group Presents: Video Intelligence to Secure and Grow

3 Reasons Why IT puts Us at Risk by Phil Godwin

Elastic's recommendation on keeping services up and running with real-time vi...

Cyber Risk Management in 2017: Challenges & Recommendations

The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...

Csec 610 Motivated Minds/newtonhelp.com

Csec 610 Extraordinary Success/newtonhelp.com

Csec 610 Education is Power/newtonhelp.com

Csec 610 Your world/newtonhelp.com

A Review of deep learning techniques in detection of anomaly incredit card tr...

Accidental Techies Half Day Session

Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017

Managing Riskin InformationSystemsPowered by vLab Solu.docx

Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...

Cybersecurity Strategies for Effective Attack Surface Reduction

More from DevOpsDays Tel Aviv

From idea to execution, the challenges of publishing an open source project are very similar to initializing a startup when it comes to creating a successful product that people will love and use. Most open source projects are not “taking-off”, although they are really good! This is because developers (which are usually the creators of open source projects) think that writing the code is the hard part and “neglect” the other parts of publishing a good open source project. In this talk, I will use my experience as a contributor to open source and product head of a startup, to go beyond writing the code itself and cover the other central aspects of creating an open source project, like MVP, product/market fit, marketing and more.

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...

DevOpsDays Tel Aviv

If you have never used GraphQL before, you probably think that it is just another buzzword that will be forgotten in a few years. You might think: “Why do I need to learn a new way to write APIs when REST already answers all my needs?”. Or, you are excited to learn something new but don’t believe GraphQL is mature enough for production. In this talk, I will remind you of some of the pain points you have probably experienced when using REST. I will then explain what GraphQL is and demonstrate how it solves these pain points. Next, I will discuss the disadvantages of GraphQL. Finally, I will provide some guidelines for choosing between REST and GraphQL. By the end of this talk, you will understand what GraphQL is and when to use it.

GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto

DevOpsDays Tel Aviv

“The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. Instead, it is built out of dozens of individual modules, each with a dedicated role - life support, engineering, science, commercial applications and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. Not only do the modules need to function together, delivering both functional and non-functional capabilities, they were designed, developed and built by different countries on Earth and once launched into space (deployed in multiple different ways), had to work together - perfectly. Despite the many (minor) reliability issues which have occurred over the decades, the ISS remains a highly reliable platform for cutting edge scientific and engineering research. In this session I will describe the way the space station was developed and the lessons Site Reliability and DevOps Engineers can learn from it.

MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...

DevOpsDays Tel Aviv

The word observable entered the English language roughly 400 years ago, but the concepts of what it means to see, comprehend, and understand something have been debated since time immemorial. Starting in the 19th century, a series of postulates and criteria coalesced into control theory, and it is from this body of knowledge that we gained the word “observability”. Today, with the advent of complex, interconnected computer systems, that word has taken on new meanings and connotations—some useful, some detrimental, and some just plain confusing. In this talk, we’ll mix a little history, a touch of philosophy, and a healthy dose of reality, to demystify what observability means to us as professional computer people. We’ll tear through the marketing material and unearth foundational principles that will help us to build better infrastructure, write better software, and promote healthier business practices. Finally, we’ll explore some potential new avenues for discussion and understanding.

PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog

DevOpsDays Tel Aviv

Security people say users are the weakest link. But are they? When complying with security becomes too burdensome, users take shortcuts, find workarounds, and end up jeopardizing security. Blaming users is lazy and easy. Making security usable is time consuming and challenging. How does design research help us understand our customers? What patterns and principles drive secure behavior? How can we build empathy with customers and make the right thing to do the easiest thing to do? This session explores these questions, and provides examples of how design thinking and research can help us be more secure. We will walk through our creation of core user personas, design principles, and how these inform and direct our design choices and intent. Don’t blame your users anymore. Come learn how to be part of a future where usability leads security.

NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...

DevOpsDays Tel Aviv

This is for you, you rockstar, ninja coffee drinking workaholic who doesn’t know what a vacation day looks like. Even though you love your job and are dedicated and are super important, you need a break too. We tend to think that working all the time is an effective practice while the truth is that finding the time for self care and recharging your batteries is beneficial for both you and your company. Additionally, if you’re a leader, you’re responsible for the wellbeing of your team. In this talk I’ll discuss the importance of taking time off of work and creating a positive culture surrounding vacation time.

(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG

DevOpsDays Tel Aviv

This is a story about taking the cloud infrastructure of a successful company, that is still managed as infrastructure of a startup company, and rebuilding it to support the growing business requirements, especially around disaster recovery and business continuity. In the session I will share Next Insurance’s journey - where we started, where we are now and what we learned on the way so far. I will talk about how we managed to build our proven DR plans, and actually execute them in our DR drills. I will also talk about why we decided that the only way to prove your DR plan works is to continue running your business in the DR account and make it your production account, and go on to build your next DR account. If you are a part of a company that is about to embark on a similar journey, this session might equip you with some very useful insights on how to think about such a challenge, and some very useful and practical tips on how to execute it.

BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...

DevOpsDays Tel Aviv

CI/CD pipelines are quickly becoming the path of least resistance for would-be attackers into sensitive internal systems, gaining access to critical data, with minimal effort. In the InfoSec world when we talk about CI/CD security often times this focuses on specific aspects of securing your pipeline - scanning the code, protecting secrets, securely managing code deployments, or even authentication and authorization mechanisms, but we rarely talk about all of these together. After years of being in the trenches and realizing that the attack surface is growing and the threat landscape becoming more and more complex, it has become increasingly apparent that security teams need to adapt and modify strategies to keep up with the new reality of CI/CD protection, without compromising developer velocity. In this talk I would like to propose a new way of thinking about CI/CD security - that encompasses the three disciplines that comprise CI/CD security - security in the pipeline, of the pipeline, and around the pipeline. Partial coverage of any or all of these disciplines simply will not cut it with the continuously evolving risk landscape. Security engineers need to address each of these aspects in their entirety to provide the full scope of coverage that modern organizations need, and I will take a deep dive on the challenges each introduce, and the approaches and techniques for mitigating them based on adversarial sec research.

THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security

DevOpsDays Tel Aviv

The last two decades have been all about SaaS, with advantages that cannot be overstated. Except SaaS isn’t always an option, nor is it always the right choice: businesses in tightly regulated industries, or where information security is paramount, for example, will not - often can not - consider any software that isn’t under their control. For many software enterprises, this leads to the dreaded inevitability of on-premise deployment. Fortunately, the situation today is dramatically different to a scant few years ago, let alone a decade or two: the same technologies that enable SaaS have also radically transformed on-prem deployment. Modern tools like Docker, Consul, ELK and Kubernetes - to name a few - can be leveraged to completely transform the experience for both customers and vendors. In this talk we’ll contrast the challenges and advantages of SaaS and on-prem, see how things have evolved in recent history, and see how modern on-prem deployment can be, if not pleasurable, at least relatively painless.

THE PLEASURES OF ON-PREM, TOMER GABEL

DevOpsDays Tel Aviv

Configuration Management is at the core of Ops. It’s the biggest enabler of any compute operation, small and big. In the past decade, we have switched from thinking about the machines we are configuring, to think about the software and services we are controlling. With that change of mindset, so did the tools we are using. Traditional tools like Puppet, chef, salt and Ansible are slowly declining while new tools such as Terraform, Pulumi, Helm and Kustomize are on the rise. In this talk I will try to describe the pain-points and the opportunities of this transformation as well as suggesting a future direction based on tools developed at the big-tech companies (Mainly facebook and google).

CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack

DevOpsDays Tel Aviv

We all know how hard it is to find DevOps engineers, and creating a diverse team despite gender and ethnicity bias? Nearly impossible. At this talk we will show our tools and methods implemented in the Develeap hiring process that overcome this inherited bias. About 2 years ago we faced a crisis in our DevOps consulting company - the market demand was higher than we could supply. The traditional recruiting process depending on CV and artificial credentials was not working. So we came up with an alternative solution, and since then - we are growing exponentially and diversely. In this talk we will show the practical tools we deployed in order to increase our capacity, and we will show how these tools overcome the inherited bias in the process.

SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap

DevOpsDays Tel Aviv

Everyone wants observability into their system, but find themselves with too many vendors and tools, each with its own API, SDK, agent and collectors. With the increasing complexity of modern applications, continuous profiling methods and tools are gaining popularity among the Developer and Engineering communities. In this session, we cover what continuous profiling entails and why you should implement a profiler into your tech stack (if you haven’t done so already). We’ll then bring theory to practice and demonstrate a real-life scenario using gProfiler, a free open-source continuous profiling tool, covering Linux servers on multiple architectures (such as Graviton).

OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...

DevOpsDays Tel Aviv

“Being oncall sucks. But it doesn’t have to!” We all heard this one before. Why is it though, that oncall still remains the biggest scar for many? What can a modern Engineering org do to rein the oncall dragons, and actually help people grow as professionals as they go oncall? In this talk, I will present the main reasons why oncall is difficult in modern orgs, and describe ways to mitigate these hardships. The idea is that oncall is often the ‘backroom’ of an org, where all the technical and organizational debt take their toll. Be it unwieldy systems or broken processes between teams, oncall checks all the ‘weak boxes’. Therefore, the only way to win at oncall is to sort out your debts, starting with the organizational ones. I will dive into the detail of the oncall rotation at Snyk as the org scaled from 1 to 220 people, what worked well about it, and what was less than perfect. I will discuss the decisions made to turn oncall into a building block of the org, and show a path to rein oncall in your organization as well.

HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH

DevOpsDays Tel Aviv

Github Copilot and tools that help us code better are cool. But I’m lucky if I spend 90 minutes a day writing code. We really need to optimize the hours we spend reviewing code, updating tickets and tracing where our code is deployed. Learn how I save an hour a day streamlining non-coding tasks. This talk is unique because 99% of developer productivity tools and hacks are about coding faster, better, smarter. And yet the vast majority of our time is spent doing all of this other stuff. After I started focusing on optimizing the 10 hours I spend every day on non-coding tasks, I found I my productivity went up and my frustration at annoying stuff went way down. I cover how to save time by reducing cognitive load and by cutting menial, non-coding tasks that we have to perform 10-50 times every day. For example: Bug or hotfix comes through and you want to start working on it right away so you create a branch and start fixing. What you don’t do is create a Jira ticket but then later your boss/PM/CSM yells at your due to lack of visibility. I share how I automated ticket creation in Slack by correlating Github to Jira. You have 20 minutes until your next meeting and you open a pull request and start a review. But you get pulled away half way through and when you come back the next day you forgot everything and have to start over. Huge waste of time. I share an ML job I wrote that tells me how long the review will take so I can pick PRs that fit the amount of time I have. You build. You ship it. You own it. Great. But after I merge my code I never know where it actually is. Did the CI job fail? Is it release under feature flag? Did it just go GA to everyone? I share a bot I wrote that personally tells me where my code is in the pipeline after it leaves my hands so I can actually take full ownership without spending tons of time figuring out what code is in what release.

HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB

DevOpsDays Tel Aviv

Do you know what it feels like to navigate as someone who can’t distinguish between green and red - looking at those badges that tell you whether something is broken or a-okay? I’ll give you a quick look into what it feels like with some examples from the monitoring tool Icinga Web 2. We all tend to forget, that not everyone sees the world like we do. In this talk I’ll be walking you through different views in Icinga Web 2 with side-by-side comparisons for the default views and how different kinds of vision impairments affect those. The talks also features a few suggestions on how to improve colour schemes and making websites and webapps better to navigate with screen readers!

FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga

DevOpsDays Tel Aviv

Recent years have exposed startups to a major plague - cloud overspend. No vaccine appears to exist, plethora of tools and consultants fail to stop the bleeding. And yet, some companies manage to stay safe. What makes them different? Is it the tools? Is it the mindset? Is it developer training? In this session we will examine the cultural factors involved in sound and responsible financial management in the cloud. We will also look at relevant system design elements and product design elements which enable us to spend wisely while our business runs smoothly. Following this session, you should be better versed in cost-aware system design and some of the cultural and structural requirements to keeping your cloud bill low.

(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY

DevOpsDays Tel Aviv

In every development process there is the question, do we invest enough on quality? Do we need to invest more? Every team knows about the dilemma of how many tests is the right amount of tests we should write. Is 80% test coverage is good enough? Maybe 90%? 100%? Should we invest more time in unit testing? Are we wasting too much time on unit-testing? Should we invest time on a faster rollback mechanism? WIIFM “Without data, you’re just another person with an opinion” - W. Edwards Deming SLO Driven Development is a framework that helps the developers focus on impact and balance of every aspect of the dev process. When working currently with SLI, SLA, SLO and error budget you can learn where to invest in the development process. Let’s talk about the importance of good SLOs and how they can help us improve our day2day

SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io

DevOpsDays Tel Aviv

In this talk, I will share do's and don'ts on how to onboard successfully in a remote or hybrid setup including moving to a leadership role, speaking from my own journey onboarding remotely in the midst of a global pandemic. I will share the tips that worked for me for successful onboarding, how I was able to be productive, impactful, and make a good impression on others. The key issues as an “onbordee” that I will talk about are how to create relationships, make yourself visible in the company, time management, and more. Since I started working in Augury over 100 new employees have joined the company. Each month I give a session that is part of their general onboarding process. This became a crucial step due to the fact that we are now a hybrid company and a lot of people are onboarding remotely or in a hybrid setup for the first time in their lives. I joined the company as a backend developer and a few months into my role, the squad leader position in my squad was up for grabs and I was fortunate enough to grab it :) This is my first official leadership role, which I also needed to onboard into in a hybrid setup. I will share the process that I built for myself on “How to lead”. Also, a word or two on the process we built as a squad on how we work in a hybrid setup, what are we optimizing for when we do meet and how to include new members of the team.

ONBOARDING IN LOCKDOWN, HILA FOX, Augury

DevOpsDays Tel Aviv

In your ever-changing Infrastructure, some changes are intentional while others are not. Drift is what happens whenever the real-world state of your infrastructure differs from the state defined in your configuration. This can happen for many reasons, sometimes it happens when adding or removing resources, other times when changing resource definitions upon resource termination or failure, and even when changes have been made manually or via other automation tools. While Terraform itself can detect drifts, in most cases, you will be informed about it too late: just before you are about to deploy new changes to your infrastructure. What’s interesting about Terraform though, is that you can apply changes in two separate and distinct steps of “Planning” and “Applying”. This means that you have full visibility of what Terraform is planning on doing beforehand, and if you are satisfied with the changes, you can choose to apply them. So how does this work? When something is changed intentionally, it will appear in the source code, and the Terraform plan will not do anything. However, if any part of the infrastructure has been changed manually, Terraform’s plan will identify this, and alert you to the change. In other words, if your IaC drifted from its expected state, then Terraform’s plan will, in fact, detect it. Applying this simple solution can empower DevOps and developer velocity, with the reassurance and context for unexpected changes in your IaC, in near real-time. This talk will showcase real-world examples, and practical ways to apply this in your production environments while doing so safely and at the pace of your engineering cycles.

DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly

DevOpsDays Tel Aviv

KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...

DevOpsDays Tel Aviv

More from DevOpsDays Tel Aviv (20)

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...

GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto

MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...

PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog

NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...

(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG

BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...

THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security

THE PLEASURES OF ON-PREM, TOMER GABEL

CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack

SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap

OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...

HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH

HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB

FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga

(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY

SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io

ONBOARDING IN LOCKDOWN, HILA FOX, Augury

DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly

KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...

Recently uploaded

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Boost PC performance: How more available memory can improve productivity

Handwritten Text Recognition for manuscripts and early printed texts

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

presentation ICT roal in 21st century education

Tech Trends Report 2024 Future Today Institute.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Finology Group – Insurtech Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

How to Troubleshoot Apps for the Modern Connected Worker

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Advantages of Hiring UIUX Design Service Providers for Your Business

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Data Cloud, More than a CDP by Matt Robison

Scaling API-first – The story of a global engineering organization

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Boost Fertility New Invention Ups Success Rates.pdf

Automating Google Workspace (GWS) & more with Apps Script

THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

1. The (ir)rational incident response The psychology behind production incidents November 2021

2. Boris Cherkasky ➔ Backend engineer and Production advocate @Riskified ➔ I 🤍 Observability ➔ Scuba Diver About me @cherkaskyb on twitter / linkedin / medium

3. Agenda 01 The psychology of an incident response 02 Intro to cognitive biases and heuristics 03 Biases in production

4. Riskified by the numbers Global team, nearly 50% in R&D Countries across the globe Online volume reviewed in 2020 650+ 180+ $60B+ 50+ Publicly held companies among our clients 98%+ Client retention for the past 2 years As of August 2021

5. The Anatomy of production incidents Time Certainty

6. Cognitive biases and heuristics 101

7. X Not it

8. Heuristics / Cognitive Biases Mental “shortcuts”

9. A Radio Commercial Loan at an interest rate of 0.5% lower than the bank’s

10. The Anchoring Bias Loan at an interest rate of 0.5% lower than the bank’s

11. Real life, production, Heuristics and biases

12. Processes Monitoring Alerting Design Optimal decision making

13.

14. Business process Step 1 Step 2 External Data source 1 External Data source 2 Internal Data source 1 Internal Data source 2 Step 7 Final Step

15. Business process Step 1 Step 2 External Data source 1 External Data source 2 Internal Data source 1 Step 7 Final Step

16. No service Inaccurat e results

17. Analysis Paralysis

18. Prioritize your SLIs Mitigating the Analysis Paralysis Latency Availability Data integrity Accuracy

19. Mitigating the Analysis Paralysis Latency Availability Data integrity Accuracy

20. A math lesson Alert #8973: Latency error: avg(avg_over_time(latency)) + 2 * stddev(avg_over_time(latency)) > 18

21. A math lesson Alert #8973: Latency error: avg(avg_over_time(latency)) + 2 * stddev(avg_over_time(latency)) > 18

22. Curse of Knowledge

23. Alerts and metrics should be set by “the common” responder, mentored by the expert Mitigating Curse of Knowledge When complex alerts can’t be avoided - document, explain, train, level UP your organization

24. Monolithic DB Shared Storage API gateway Monolith Configuration Configuration

25. Monolithic DB Shared Storage API gateway Monolith Configuration Configuration 403 - Forbidden

26. Monolithic DB Shared Storage API gateway Monolith Configuration Configuration 403 - Forbidden

27. Monolithic DB Shared Storage API gateway Monolith Configuration 403 - Forbidden Configuration

28. Simulation heuristic

29. Mitigating the Simulation heuristic Set the responder on the correct path as soon as possible, with minimal friction Minimize the time to start triage

30. CPU usage Available Memory 7:30 8:00 7:30 8:00

31. 7:30 8:00 7:30 8:00 2% 200MB CPU usage Available Memory

32. Confirmation bias

33. Show simple and standardized data Mitigating the confirmation bias

34. Show simple and standardized data Mitigating the confirmation bias

35. ● Don’t work alone ● Draw a concrete line between the observed facts, your hypothesis, and the existing state (outcome/outage) Mitigating the confirmation bias

36. Cheatsheet Keep anything production simple Specific alerts, Standardized dashboards Prioritise SLIs (SLI pyramid) Normalize production status with the “average” responder

37. Boris Cherkasky cherkaskyb@gmail.com @cherkaskyb / Twitter / medium / linkedIn Thank You For Your Time!

Editor's Notes

Hey everyone! I wanna take you to one the most memorable evenings i had as a on call engineer. It was a Tuesday, I am at home, watching some movie with a warm pizza in my hand, when i get an alert from our production system. This alert turned that warm Tuesday night to a four hour long nightmare. At 1AM at night, i look back at the incident, after everything is back to normal, and my only though is - this could have been solved in 20 minutes. My greatest lesson from that Tuesday was that even the most experienced responder can make a wrong decisions. Those wrong decisions, and what causes them, are the topic of this talk.
So thank you for having me, I’m Boris Cherkasky, and I’ve been breaking, fixing, and monitoring production systems, for the last four years. Im a backend engineer and production advocate at Riskified, and i am generally fascinated by observability. I love SCUBA diving and I write small tech blog that you can find using my handle is cherkaskyb on most social networks and medium
This talk is a journey into how our minds work, and how it “plays tricks” on us during production incidents. We’ll start with some anatomy of the incident response process, Then we’ll do a short introduction to cognitive biases and heuristics, And the majority of this talk will focus on real life incident examples. In those examples we’ll cover how cognitive biases gets manifested in production incidents, and how we can mitigate those effectively.
A few words about Riskified.Riskified enables top brands to fulfill their maximal e-commerce potential, by leveraging AI to help with fraud prevention, and other financial funnel optimisations. About 60 billion dollars worth of orders from all around the world goes through our systems Let’s get back to the previously mentioned Tuesday. The incident started with total uncertainty - I got an alert from one of our most reliable systems, one that malfunctioned only once in the last four years. I later learned, that this uncertainty, is what allows psychological biases to thrive, and affect our decision making.
Let’s make an hypothetical chart of certainty over time during my incident, and i’ll walk you through the decisions made in those four hours. I first decided to go to a dashboard and not Logs, it’s a decision. It happens to be a good decision - I see some irregularities! my certainty grows. I then get a message from a colleague who reports some partial information that contradicts what i am seeingour certainty plumits. We scratch our heads for a minute, and open a more specific dashboard! It puts us on the path that one of our dependencies is imparied.our certainty is grows. We think we know what’s the root cause, and we DECIDE to restart an instance - it doesn’t help. our certainty plumits once again And this goes on until the incident is resolved four hours later - a ping pong game between certainty, and uncertainty. Because of this uncertainty, in each decision point, we are prone to cognitive biases that suppose to ease our decision making process, but in fact, might cause us make the wrong decisions.
Let’s begin our journey with defining what heuristics and biases are
Since i'm gonna be talking about psychology, i am first legally obligated to mention that i am not a trained psychiatrist.
When talking about decision making under uncertainty, two terms come to mind - heuristics and cognitive biases. To simplify this talk i’ll address them both - “biases”, and the actual difference between them is not critical. Biases are mental patterns - shortcuts in decision making our mind makes to simplify this complex task. A good analogy for a bias is a branch prediction in the CPU - it’s a “calculated shortcut” the CPU makes - it can work, and is working many times, but when it doesn’t - a bad call was made, and we need to roll back to the correct state.
Let’s start of with an example, A Radio commercial might state: Boris insurance inc. offers a Loan at an interest rate of 0.5% lower than the bank’s even that one small sentence has biases in it, put there on purpose to help you make the complex “loan decision”
In this example - It’s The anchoring bias - a cognitive bias where an individual's decisions are influenced by a particular reference point or 'anchor' Our mind is now anchored to interest the bank is offering - every decision we’ll be making is anchored to this reference point, regardless if that’s a good reference point or not. And as you can probably guess - It’s probably not.
But this was a commercial, What happens in production? What biases are we prone to there?
Before we dive into this We have to understand that biases will come to life in our weakest point - an incident. There’s no way around it. What we can do, is to limit the effect, or volume of those biases, by preparing for them early on in our development process. by: Designing “bias proof” systems Maintaining “bias aware” environment with each change we deliver <PAUSE> using effective alerting and monitoring. Create bias reducing response procedure By applying the measures i’m gonna discuss, you can benefit in faster incident resolution, lower frustration within the response team, and maybe even get back to your pizza, while it’s still warm.
The incidents you’re about to see were managed by trained professionals, do not try them at home (or work). Each example im about to cover is a real life incident where our responders got blind sided by a bias that affected their behaviour.
The first incident, is one where of our most critical data sources failed. The backup didn’t work well enough, and our whole business process came to an halt. I’m sure you have a business process similar to ours in your system too - one that is composed of steps, and each one of those can fail One of the responders suggested we disabled the datasource and run without it, <CLICK>
in the “fog of war” it sounded like a solid idea, since it’ll get the service back to life. The first time we almost turned it off, 30 minutes into the incident, one of the Analysts mentioned it’ll breach SLA for one of our top customers. The second time we almost turned it off, 50 minutes into the incident,it was head of engineering that mentioned it’ll cause downstream pressure on other systems. The more this idea floated the room, more stakeholders spoke up on the impact it’ll cause - from accuracy and latency, to legal and operational. This. was. Just. frustrating. No one cloud make the call, while In the background - the whole business in impared. About two hours into the incident, the idea escalated to the Chief Of Operations, who made the decision - we can’t turn it off, and we waited for the underlying issue to resolve.
To clarify - we had 2 options First - wait for the underlying cause to resolve - and have no service until then. Second - turn off the failing service, which will cause multiple issues around accuracy, SLA, and other. Both are Bad. But NOT EQUALLY, we just didn’t know which one is worse, and needed the Chief of Operations to decide. Our response team was paralised, the decision couldn’t be made.
This is the Analysis Paralysis - It manifested in our inability to make the call to turn off the data source. Analysis Paralysis is a psychological effect where The more knowledge we have - the harder it is to make a decision - all the alternatives and outcomes are being weighed, without coming to a decision. In our incident, the whole business was impared, therefore the response team was huge, with additional stakeholders flooding the room. It took 2 hours, with more than 10 people involved to make that decision. the response team wasn’t independent in it’s decision making. So, How can we give my response team it’s independence back?
It’s not always possible, but if we start at the requirements and design phase, we can define an order of importance for our system SLIs - A pyramid. When the priority is explicitly defined, The SLIs at top of the pyramid will be sacrificed to secure the SLIs at the base of the pyramid and the response team is free to independently make fast decisions to mitigate degradations.
In our case, the pyramid would have stated that the most important SLI would have been accuracy, so we’d know we can’t sacrifice accuracy for availability. This example shows how a process of “bias proofing” in the design phase, can later on help mitigate those effects in production incidents. We’ve touched what we can do in the design phase, let’s now see how we deliver features, and define alerts on them.
This example is around a single alert. This one <point up>I’ll give you a few seconds to look at it, it’s a Pseudo Prometheus QL. Now, let me have your attention again.You’re now as confused as I was. It’s 10 in the evening and all i know is that there is an “issue with the service’s latency”, and im in what appears to be a math lesson! It’s been 10 years since i last saw standard deviations, and i have no clue what are 2 standard deviations. By raise of hand, how many of you know what two standard deviations are? CLICK
I start this incident with google and wikipedia to understand what this alert means. How did we get here? How come this alert find it’s way to production when i have no idea what to do with it?
My surprising midnight math lessing is a result of a bias called ”curse of knowledge”. And it has 3 main effects: The first effect is The tendency to assume knowledge one possess, is common knowledge - The author of my alert thought it’s basic knowledge that 2 standard deviations in uniform distribution is the 96’th percentile, so my incident in fact, is p96 latency increase. The second effect is The lack of ability to rollback to your “unknowing” state - this is why teaching is so hard, and one of the reasons getting new shifters to be experienced and confidant is complex. The third effect is Predicting another person's action is highly biased towards one's knowledge of the issue - this is why writing run books is hard. Runbooks are documented recipes for mitigating and incident. The course of knowledge is why many run books have “missing steps” and implicit knowledge. How can we mitigate this? We obviously want all our responders to be experts! Knowledge is a good thing!
To mitigate the curse of knowledge make sure your alerting and monitoring layer is done by the “average” responder - in other words, normalize the expertise level you need, to the average you have. Have your experts review their work, and train the team, but avoid having that “one monitoring person” in your systems. If complex monitoring and alerting can’t be avoided - document them thoroughly, again - by the “average responder”. train, and level up your organization. Don’t let your responders learn during incidents. We now touched how writing alerts and monitoring can be affected by biases, in the next example we’ll dive even deeper into monitoring.
The next incident manifested two biases, and was one of my most painful production incidents. We’re gonna talk about those biases one by one. Let’s first talk about the system at hand - We started implementing a new generation services in micro-service architecture, to do so we needed configurations stored in our monolithic main database. So we’ve created a ETL process that exposed those configurations into a shared storage for all relevant service to use. One of the configurations there, <PAUSE> was highly sensitive map of which features are enabled for each of our customers.
The incident started with elevated error rates on the API layer - we were rejecting API Calls - some customers were being refused key features of our product. The alerts were originated from the API layer, and knowing the process, I started simulating what can be the cause for this behaviour. And i suspected degradation in performance in the shared storage. We’ve decided to manually re-run the ETL, and in did it solved the issue. for an hour. I don’t know if you ever experienced a P1 incident that you thought you’ve solved coming back to haunt you, after you’ve already notified the business and management that everything is back to normal. This is really an uncomfortable feeling, that made me doubt my engineering skills.
I’ve gathered some my teammates and we’ve started the investigation again, and another hour in, my teammates found the root cause - a bug in the replication process in the ETL.
My teammates found it, but not me. for that hole hour, while they were going through the code and logs, i was digging into that shared storage, proving (mostly to myself) why it’s indeed in degraded state. I couldn’t hide my surprise! A BUG in a component that is working smoothly for more than a year with no scale or any other change. It was literally the among the last things on my possible root cause list, somewhere around cosmic radiation. This incident was a grueling process of checking each component in this flow one by one. The errors originated from the actual SLI that was degraded, but the issue was far up stream. Why did I dug so deep into that datastore, why couldn’t I see im on the wrong path?
I was deeply affected by the simulation bias. The simulation bias states that one's judgments are bias towards information that is easily imagined or simulated mentally by them. And i was simulating the datastore as the cause. It’s important to mention that the simulation is subjective - what I can simulate, others maybe can’t. This is why i wasn’t able to simulate a bug, and our data engineers probably wouldn’t simulated database performance issue. Simulation causes high friction with the production system, an in my case - focusing on the wrong elements. So, what can we do to control what our responders simulate? This sounds like a challenging task
The problem with the response process was high surface area between the alert and the system. The alert was at the end of a long chain of components, and each one needed to be checked to find the issue. It was a process of many steps to get to the root cause, and the time spent in that process is time spent on simulating wrong paths. Firstly, We need to set the responder on the correct path as soon as possible, we should aim doing so, with minimal friction So alerts and monitors should be VERY specific and on any dependency and key SLI, we better have 20 simple specific alerts than 1 catch all alert. If the alert was on the ETL process, chances are, i’d start by digging in into it’s logs, rather than working my way back from the API, through the shared datastore. Secondly, we need to minimize the time to start triage. Time spent without data, is time spent simulating. If possible - incident insights should be pushed to the responder, instead of waiting for the responder to pull the information they need. That means charts, and logs related to the incident can be automatically added to the incident at hand (most incedent management tools supports such integration to some extent). So far for simulation, and reducing friction and surface area of the alert. We’re now ready to talk about the final bias, that also attacked me during this incident.
I know what you’re thinking - I am an experienced responder, I should base my decision on concrete data. I'm not gonna lie - I did, i had data to support my hypothesis. How was i able to “prove” that the data store is the issue when it wasn’t? I went straight to the source, to the datastore metrics. There It was, my smoking gun - the CPU usage increased, and available memory decreased!
In fact, My dangerous increase in CPU was only 2%. and the memory? A drop of only 200MB on a large instance. The axis in the dashboards were dynamic, but I rushed into action and missed that. This misinterpreted data was enough to convince me, that indeed the database is the issue and send my on a wild goose chase while my teammates were actually narrowing in on the issue. Why was it so easy for me, an experienced responder, with vast <PAUSE> daily mileage with observability tools, to misinterpret what I was seeing?
It Comes down to the confirmation bias - We try to seek information that reinforces existing positions. We come to a conclusion first and try to find information that fits it. ignore information, and translate ambiguous information in our favor. When you think you know what the incident is, it easy to find patterns that re-enforces it. In my case, a 2% increase in CPU confirmed a wrong hypothesis, and literally made me useless in this incident. Now when we know what that bias is, how can we mitigate it?
First - Let’s talk about how we show our data. Keep it simple! Show <PAUSE> simple <PAUSE> data.Complex data is easy to “manipulate” or mis-interpret That includes sensible Legends, Colors and Scales - errors should be Red, throughput probably green. In my case - if I’d shown the CPU usage percentage with a static scale of 0 to 100, the there would have been no visible change at all, not to mention a spike.
One more thing about simplicity - show data with as little Dimensions as possible - dimensions are complex! Same goes for multiple axes on a single chart, elaborate coloring schemes, and heatmaps. Im not saying NOT to use those, but be very aware when you are! Next, Standardize your data! When majority of dashboards will be similar - looking into any dashboard during an incident will feel familiar for the responder, thus reducing the chance of mis-interpretation. In my case, i was rarely using CloudWatch for metrics, therefor i wasn’t fully aware that the scale there is dynamic.
That’s about data, now a bit about process: Don’t <PAUSE> work <PAUSE> alone, Incident response is a team effort - Show meaningful data that reinforces your positions to your teammates - convince the “unconfirmed” responder that you are correct. Draw a concrete line between the observed facts, your hypothesis, and the existing production state. Chances are that any of my teammates that would have seen the CPU and Mem chart would have smiled and pointed out my mistake.
This is all the examples i have for you today. To wrap things up, id like to show a short cheat sheet that can help handle some of the biases we’ve talked about: Keep anything production simple Specific alerts, Standardized dashboards Normalize production status with the “average” responder Prioritise SLIs (SLI pyramid)
I’m writing about the connection between software and psychology in my blog from time to time and I’ll repost those on twitter and linkedin, so if you’ve found this talk interesting, be sure to check it out. That’s all i have for you today, It’s been a pleasure, thank you for you time!

THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech (20)

More from DevOpsDays Tel Aviv

More from DevOpsDays Tel Aviv (20)

Recently uploaded

Recently uploaded (20)

THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

Editor's Notes