The document discusses how to learn from failures through effective retrospective meetings. It recommends that retrospectives include proper preparation like choosing a facilitator and building a timeline. During the meeting, the most involved engineer should provide context, customer impact should be discussed, and the discussion should focus on process improvements rather than blame. Many potential improvements or "remediations" may be identified. Both engineering and product teams should consider improvements to prevent future issues and improve customer experience. Effective retrospectives can help organizations continuously learn and improve.
What does it mean to test? What is yours best testing ever conducted? Read this and share your thoughts.
This post was originally published on Miagi-Do School of Software testing blog. Since it is down I am publishing it here.
Actionable Agile Metrics for Predictability - Daniel VacantiAgile Montréal
Actionable Agile Metrics for Predictability
“When will it be done?” That's the first question customers ask once work is started. Your predictability is judged by the accuracy of your answer. Think about how many times you’ve been asked that question and how many times you’ve been wrong. That you’ve been wrong more times than right is not necessarily your fault. You have been taught to collect and analyze the wrong metrics. Until now.
About Daniel Vacanti
Daniel Vacanti is a 20+ year software industry veteran who has spent most of his career focusing on Lean and Agile practices. In 2007, he helped to develop the Kanban Method for knowledge work and managed the world’s first project implementation of Kanban that year. He has been conducting Lean-Agile training, coaching, and consulting ever since. In 2011 he founded ActionableAgile (previously Corporate Kanban) which provides industry-leading predictive analytics tools and services organizations that utilize Lean-Agile practices. In 2015 he published his book, “Actionable Agile Metrics for Predictability”, which is the definitive guide to flow-based metrics and analytics. Daniel holds an M.B.A. and regularly teaches a class on lean principles for software management at the University of California Berkeley.
What does it mean to test? What is yours best testing ever conducted? Read this and share your thoughts.
This post was originally published on Miagi-Do School of Software testing blog. Since it is down I am publishing it here.
Actionable Agile Metrics for Predictability - Daniel VacantiAgile Montréal
Actionable Agile Metrics for Predictability
“When will it be done?” That's the first question customers ask once work is started. Your predictability is judged by the accuracy of your answer. Think about how many times you’ve been asked that question and how many times you’ve been wrong. That you’ve been wrong more times than right is not necessarily your fault. You have been taught to collect and analyze the wrong metrics. Until now.
About Daniel Vacanti
Daniel Vacanti is a 20+ year software industry veteran who has spent most of his career focusing on Lean and Agile practices. In 2007, he helped to develop the Kanban Method for knowledge work and managed the world’s first project implementation of Kanban that year. He has been conducting Lean-Agile training, coaching, and consulting ever since. In 2011 he founded ActionableAgile (previously Corporate Kanban) which provides industry-leading predictive analytics tools and services organizations that utilize Lean-Agile practices. In 2015 he published his book, “Actionable Agile Metrics for Predictability”, which is the definitive guide to flow-based metrics and analytics. Daniel holds an M.B.A. and regularly teaches a class on lean principles for software management at the University of California Berkeley.
Testers have been taught they are responsible for all testing. Some even say “It’s not tested until I run the product myself.” Eric Jacobson thinks this old school way of thinking can hurt a tester’s reputation and—even worse—may threaten team success. Learning to recognize opportunities where you may NOT have to test can eliminate bottlenecks and make you everyone’s favorite tester. Eric shares eight patterns from his personal experiences where not testing was the best approach. Examples include patches for critical production problems that can’t get worse, features that are too technical for the tester, cosmetic bug fixes with substantial test setup, and more. Challenge your natural testing assumptions. Become more comfortable with approaches that don’t require testing. Eliminate waste in your testing process by asking, “Does this need to be tested? By me?” Take back ideas to manage not testing including using lightweight documentation for justification. Not testing may actually be a means to better testing.
This is the third iteration of my original talk, given at the Big Design Conference in Addison, TX.
The original version of the presentation is at http://slidesha.re/1taV7Rg
Obstacles of Digital Transformation EvolutionEqual Experts
The talk will focus on some things that any consultant or leader should consider when entering into an organisation that has a stated desire to transform into the most Digital organisation possible.
Speaker: Ryan Bryers, Digital Transformation and Leadership, Equal Experts
Decision making is fundamental to any professional activity. The study of “decision bias” is a fascinating subject. These studies show that the root cause of most faulty decision making is a wrong assumption.
One of the most common faulty assumptions in the product development world is that the development budget and time is an obvious thing. This assumption leads to a world of trouble
When designing, building, and maintaining a computer system, one can ask the following questions: "Is my data safe from being accidentally deleted or corrupted? How do I ensure data integrity in the long term?"
The main goal of the presentation is to analyze several data integrity pitfalls and review recommended solutions, so that you can construct a data integrity strategy appropriate for your service and your DevOps team.
Setting up a PMO can feel like a nightmare, but there is a solution. Learn what it takes to wake up from that nightmare and start seeing greater results.
Webinar - Top 5 Strategies for Digital Process AgilityBizagi
This Webinar explores the top strategies for Digital Process agility. Hosted by Jan Marek (Generali), Jorge Garcia (Technology Evaluation Centre) and moderated by Bizagi CMO John Webster. This jam-packed webinar included live audience polling and insights into why BPM and Digital transformation go hand in hand.
Testers have been taught they are responsible for all testing. Some even say “It’s not tested until I run the product myself.” Eric Jacobson thinks this old school way of thinking can hurt a tester’s reputation and—even worse—may threaten team success. Learning to recognize opportunities where you may NOT have to test can eliminate bottlenecks and make you everyone’s favorite tester. Eric shares eight patterns from his personal experiences where not testing was the best approach. Examples include patches for critical production problems that can’t get worse, features that are too technical for the tester, cosmetic bug fixes with substantial test setup, and more. Challenge your natural testing assumptions. Become more comfortable with approaches that don’t require testing. Eliminate waste in your testing process by asking, “Does this need to be tested? By me?” Take back ideas to manage not testing including using lightweight documentation for justification. Not testing may actually be a means to better testing.
This is the third iteration of my original talk, given at the Big Design Conference in Addison, TX.
The original version of the presentation is at http://slidesha.re/1taV7Rg
Obstacles of Digital Transformation EvolutionEqual Experts
The talk will focus on some things that any consultant or leader should consider when entering into an organisation that has a stated desire to transform into the most Digital organisation possible.
Speaker: Ryan Bryers, Digital Transformation and Leadership, Equal Experts
Decision making is fundamental to any professional activity. The study of “decision bias” is a fascinating subject. These studies show that the root cause of most faulty decision making is a wrong assumption.
One of the most common faulty assumptions in the product development world is that the development budget and time is an obvious thing. This assumption leads to a world of trouble
When designing, building, and maintaining a computer system, one can ask the following questions: "Is my data safe from being accidentally deleted or corrupted? How do I ensure data integrity in the long term?"
The main goal of the presentation is to analyze several data integrity pitfalls and review recommended solutions, so that you can construct a data integrity strategy appropriate for your service and your DevOps team.
Setting up a PMO can feel like a nightmare, but there is a solution. Learn what it takes to wake up from that nightmare and start seeing greater results.
Webinar - Top 5 Strategies for Digital Process AgilityBizagi
This Webinar explores the top strategies for Digital Process agility. Hosted by Jan Marek (Generali), Jorge Garcia (Technology Evaluation Centre) and moderated by Bizagi CMO John Webster. This jam-packed webinar included live audience polling and insights into why BPM and Digital transformation go hand in hand.
Introducing SNAP Portal: A modern intranet portal built on Office 365
Does your business have an old intranet that is infrequently used? You are not alone; the evolving workforce of today expects more from an intranet than just some news, announcements and a weather web part.
Empired’s SNAP Portal can help you rapidly challenge this all-too-familiar status quo, by delivering a modern, customised intranet portal which leverages the pieces of the Office 365 stack that are relevant for your workforce, creating a single source of truth for your business and its people and enabling swift return on investment.
This slide deck explores how Empired designs and builds a modern portal leveraging Office 365, and showcases Empired’s own SNAP Portal; which pulls the disparate features of Office 365 together into a single, cohesive workspace. It will also include a discussion about tools, such as Yammer vs. Office 365 Groups vs. Teams, how they can all fit and which ones are right for your business.
Digital Transformation How to Reboot IT and Business CollaborationBizagi
70% of organizations say that efforts to transform the business are undermined by internal complexity, including legacy technologies and a lack of collaboration between the business and IT.
Business functions can’t wait months for solutions, but IT leaders must to retain oversight to prevent digital projects from spiralling out of control.
View this presentation from a live Webinar to see how Takeda Pharmaceuticals has used a Digital Business Platform to rapidly build agile applications approved by IT, but owned and customized by the business teams that use them – unlocking benefits that would be attractive to any organization.
View to get practical insights from how Takeda:
•Enabled the operational agility needed to digitally transform
•Rapidly digitized core processes including procurement
•Unlocked the potential for enterprise-wide cost savings
We’re all busy—and it’s a common theme in most professional workplaces—with people trying to get more done with finite time and resources. For a lot of firms today, a major challenge is making sure we’re spending our precious time making the most of every business opportunity by maximizing client relationships. Marketing & BD teams need to know if they’re focusing on the opportunities, RFI’s and RFP’s with the best potential; that they are managing their firm pipeline effectively; and they are giving everyone on the team the best tools for the job—wherever they are.
That’s why firms today are focused on creating great client relationships – they need to try to reduce complexity and make it easier to maximize opportunities, provide excellent client experiences, and grow the firm.
Why should a legal or accounting firm care? Because client experience and knowledge is increasingly important for firms of all sizes – across all client interactions.
Client experience…
…is what differentiates your firm
…is how you win and keep clients for the long-term
…is how you grow your firm …and in a world where clients are mobile and social, your reputation (& brand) is more important than ever.
Today you don’t have as many contact points as you used to have, and you have to make every one count. To thrive in this ultra-competitive environment firms of all sizes have to make client experience a priority. That is why you need Dynamics 365 and xRM! Visit our websites www.xRM4Legal.com, www.xRM4Accounting.com and www.xRM4Finance.com or email Dynamics@xRM.email
Dynamics Day 2016 - Microsoft Dynamics 365 sales and customer service (CRM) ...Empired
Get the up to date view on Dynamics 365 Sales and Customer Service updates coming in the upcoming November release, including CRM functional enhancements, platform features and data insights extensibility.
BearingPoint has developed a comprehensive approach to deal with large workspace outfit projects, managed as real “transformation” projects involving major changes around work modes, workspaces and IT/digital equipment.
This proven award winning Treasury solution allows you to align, centralize and optimize your payment and collection processes across your entire group.
Dynamics Day 2016 - Digital transformation with Microsoft Dynamics 365Empired
On his keynote, Simon Davies, VP Dynamics Asia (Microsoft AP, China, India, Japan) talks about how Microsoft clients are transforming their businesses with Microsoft Dynamics.
Dynamics Day 2016 - Microsoft Dynamics 365 the future of DynamicsEmpired
The launch of Microsoft Dynamics 365 brings new, modern, enterprise-ready apps that enable companies to start with what they need, get productivity where they need it, leverage intelligence built-in, and remain ready for growth. In this session, gain valuable insights into what Dynamics 365 is and how it will transform your business. Explore topics such as product capability, licensing and roadmap
The essential elements of a digital transformation strategyMarcel Santilli
Learn more: https://insights.hpe.com
Enterprises can survive digital disruption as well as grow revenue, improve profitability and increase market valuation — if they start rethinking what they do.
Digital transformation. It’s the use of technology to create a better customer experience, improve products and services, and increase the effectiveness of business operations. But it really means what your enterprise must do to adapt and thrive.
Today’s smaller, emerging companies are born digital. They can — and do — change quickly to answer consumer demand or a competitive offering. Larger, mature enterprises must start with a shift in strategy, because all industries will be changed or already have been changed by digital transformation. Many, like news media and publishers, music, video and retail have been or significantly disrupted. Up next: financial services, healthcare manufacturing, insurance, legal, education, utilities and energy. The good news? No industry has been or will be completely upended.
The first step is to recognize that disruption does not have to be a mass-extinction event. Enterprises can survive as well as grow revenue, improve profitability and increase market valuation. Here’s how to start rethinking what you do.
Digital Transformation - How to Deliver Meaningful ResultsBizagi
Stop right there, here is the inescapable truth: Every large business, everywhere needs to transform in order to survive – and transformation is beyond difficult. So let’s get beyond the aspiration and talk about the practicalities of this journey for real businesses. This presentation is intended for all business and technology leaders tasked with delivering digital change in 2017 and beyond.
Read MWD Advisors’ Research Director Neil Ward-Dutton as he shares his analyst insights into how you can deliver meaningful results from your digital transformation initiatives this year.
Neil shares his views on one of the most common mistakes that businesses make when trying to digitally transform – which is to think only about the front-end, marketing driven side of the customer experience.
Find advice on how to:
• Manage change more collaboratively, quickly and cost-effectively
• Improve the customer experience through operational change
• Select the right technology to enable strategic digital innovation
Digital Transformation and the Customer ExperienceMat Ford
Exploring the barriers to Digital Transformation, and providing a framework to bring about evolution while understanding the changing nature of Customer Experience.
Developing a Roadmap for Digital TransformationJohn Sinke
Digitally mature companies out-perform their peers in innovation, agility and responsiveness to customers. “Digirati” also enjoy advantages in efficiency and effectiveness in product delivery, marketing, e-commerce, sales and customer service. More importantly, companies that achieve Digital Excellence are 26% more profitable (source: Capgemini Consulting and MIT Centre for Digital Business).
However, building a Roadmap for Digital Transformation requires not only successful collaboration between the CMO and the CIO, it also demands a strong customer-focused orientation and digital culture. During this presentation, John Sinke will share insights from leading marketers and his personal experience of turning Resorts World Sentosa into a “digital business”.
Get things done : pragmatic project managementStan Carrico
Bitovi summer training camp presentation on communication and project / task management.
Roleplay dialog:
Version 1 (not the best)
PM : How is this new chart progressing? You have been working on it for two weeks and it needs to be complete by end of week.
Dev round A:
I'm working as fast as I can! I'm trying to get it done by the end of the week.
PM : Well, I'll check in with you again in a few hours.
Dev round A : I need more time than that, why don't you give me a day and then try me again?
PM : I need this to be done by Friday and it's already Thursday. How much longer do you need?
Dev round A : I don't know, but longer than a few hours..
__ Version 2 (better)
PM : How is this new chart progressing? You have been working on it for two weeks and it needs to be complete by end of week.
Dev round B : The chart consists of the plot, the axes and the css styles we're applying from the design mockup. I have completed the plot, and I estimate that the axis and applying styles will each take about 6 hours to complete. The plot took me longer than I expected. I think we should plan to demo the full chart on Monday.
PM : Can you update me when the axis and styles are done?
Dev round B : Sure. Does the business need to give us feedback on the plot, axis or styles? We can demo the plot now, the axis will be ready in the morning, and the styles will be applied and ready to demo on Monday morning.
PM : No, I think we need the complete product. I'll verify the don't need to give feedback on the pieces.
Dev round B : Ok, I'll send you a note when they are finished tomorrow evening, or I will update you before that if I run into any blockers.
References
The Pragmatic Programmer 1999 By Andrew Hunt and Dave Thomas
Team Geek 2012 By Brian W. Fitzpatrick, Ben Collins-Sussman
Head First Object-Oriented Analysis and Design 2006 By Brett McLaughlin, Gary Pollice, David West
The Agile Samurai 2014 Jonathan Rasmusson
Behind Closed Doors 2014 By Johanna Rothman, Esther Derby.
When you're starting or running a company, how do you choose technology? The prevailing advice du jour is something along the lines of "use the best tool for the job." This is obviously right, but it is also devoid of meaning in an unfortunate way that lets people define "best" and "job" as myopically as they like.
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklPROIDEA
Join me in this talk about why high workload leads to increasing waiting times and is detrimental to your project’s efficiency. We will not only talk about queueing theory and capacity management, but also about strategies to cope with high utilization and how to start a virtuous circle.
Can we write successful enterprise software without challenging assumptions? Agile doesn't happen in a vacuum. Here's what I discovered using EventStorming as a blade to cut through business, software and organisation dysfunctions. From XP2017 Cologne.
Evolve or Die: A3 Thinking and Popcorn Flow in Action (#LKCE14)Claudio Perrone
Slides I presented this week for the Lean Kanban Central Europe 2014 #lkce14 conference in Hamburg (and subsequently at Build Stuff in Vilnius) about Lean Management with A3 Thinking and Popcorn Flow. It consolidates some of my latest thoughts on the matter.
You may also be interested in the article that InfoQ published shortly after: http://www.infoq.com/news/2014/11/lean-thinking-change
As growing developers, we owe it to ourselves an organizations to stay on top of technology trends and tools. This talk is about how to suggest change in your organization without being too timid or too forward.
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
Engineers are frequently tasked with being front and center in intense, highly demanding situations that require clear lines of communication. Our systems fail not because of a lack of attention or laziness but due to cognitive dissonance between what we believe about our environments and the objective interactions both internal and external to them.
It’s time to revisit your established beliefs surrounding failure scenarios, with an emphasis not on the “who” in decision making but instead on the “why” behind those decisions. With attention to growth mindset, you can encourage your teams to reject shallow explanations of human error for said failures and focus on how to gain greater understanding of these complexities and push the boundaries on what you believe to be static, unchanging context outside your sphere of influence.
Will Gallego walks you through the structure of postmortems used at large tech companies with real-world examples of failure scenarios and debunks myths regularly attributed to failures. You’ll learn how to incorporate open dialogue within and between teams to bridge these gaps in understanding.
Similar to Joy Scharmen - The Virtuous Cycle: Getting Good Things Out of Bad Failures (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
48. Is your fix a small thing you
can add to existing customer
tools?
Engineering should be able to do this with
minimal product sign off.
49. You can improve your
customers’ experience.
Your customers, your fellow engineers, and
your community can benefit from your own
needs and hard won experience.
Hi, I’m Joy and I’m the SRE director at Heroku.
For those of you who aren’t familiar with Heroku, we’re a Platform as a Service. This means we handle a lot of the operations work for the customers who run on our platform. My job is to keep our platforms maximally stable so our customers can sleep easy at night.
I'm here to talk about failure and why I love it, or at least don’t hate it.
Why would I want to talk about failure? Failure is amazing — it can be our best teacher. As an SRE failure is utterly crucial to me doing my job. Complex systems often fail and we learn so much more from their failure than from success.
A lot of us have probably had this realization. If we didn’t have failure, we’d be out of a job.
So the question today is how do we learn from that failure? How do we learn from that failure in a way that doesn’t make us feel like failures?
Let's start with an SRE war story — everyone loves a good war story.
How many of us ever run out of integers in an auto-incrementing primary key column in a database?
The whole database halts because it just ran out of numbers. And it’s usually a critical database.
I've seen this failure mode pretty much everywhere I've ever worked as an SRE.
It's pretty embarrassing because seriously -- you just ran out of numbers. It seems really easy to fix but it just keeps cropping up.
So what are some of the reasons that this keeps happening?
Commonly used frameworks have defaults that can come back and bite you later.
Assumptions about the size of your database before it hits production. It’s a good problem to have when you’re successful enough that you outgrew your original assumptions. Two billion — that's a lot of numbers.
Or just not thinking about it at all! That's probably the most common reason.
So we had this happen to us twice in two months. That was pretty bad.
Then it happened a third time almost a year later. For me as the head of SRE seeing this again was pretty painful!
We run a Platform as a Service! Our whole premise is doing operations for our customers so they don’t have to. So how do we fix this problem for real?
First we have to consider it more deeply than we did at the start. If the obvious fix was the long-term fix it wouldn't keep coming up.
It’s simple enough to fix one occurrence of this, just change it to BIGINT. Data starts flowing again, folks go back to business as usual.
When this happened the second time we applied a similar fix, and we also poked around manually at other crucial DBs that might have this problem. We even caught a few before failure that way.
We needed to fix this a lot more systemically. Fortunately there’s a good tool for that!
So who here is familiar with retrospectives? I imagine most people here have been to or at least know about them as a place to reflect on past projects or incidents.
One of the main things that SRE instituted at Heroku were retrospectives for all customer affecting incidents.
If you have been to a retrospective, you probably have been to a boring retrospective. I know I’ve run boring retrospectives. Sorry.
I used to think that if you just got the right people in a room together to chat over an incident, things would naturally happen and we’d have a great, engaging conversation and leave with an amazing solution that would fix our problems. Maybe it would also solve world hunger.
In reality, when you pop a 1 hour meeting on a bunch of folks’ calendars about an outage with no context, this stuff happens:
Some folks don’t show, because they are allergic to calendars, email, and meetings.
The ones that do show might be there because they have an axe to grind, or because they feel like they have to defend themselves.
Establishing the timeline in the meeting leads to bickering and “well, actually” statements that put everyone who wasn’t in a bad mood into a bad mood.
Once everyone is sufficiently miserable, you’re most of the way through your time. You have about 5 minutes to give people some work to do as the cherry on top of the misery sundae.
If that doesn’t happen, everyone is bored and tuned out. The engineers are all doing email. The facilitator is doing email. No one’s paying any attention. At the end of the meeting you have some cursory remediation items and if you are lucky some might actually get done.
For a retrospective to be useful, it can’t be boring. A retrospective is the pivot point between failure and learning. If it’s boring, no one is learning and you might as well give everyone back the time in their day they were sitting in the meeting.
Putting a bunch of highly-paid engineers in a meeting for an hour in which they don’t learn anything is a waste of time, money, and morale.
One problem we had with the first INT rollover is that we didn’t have a retrospective, because folks thought that they were a waste of time for something so trivial and easily understood. They were trying to avoid a boring time consuming meeting without a clear sense of what value it would have.
This makes sense. I avoid boring meetings too. In this case, the problem was deceptive. Had we dug into it the first or even the second time we would have been able to discover that.
So how do you have non-boring, useful retrospectives?
One way to create engagement during the retrospective is by preparing for the meeting. Don’t force people to watch the sausage being made.
It is excruciating for someone to attend a meeting and then have to figure out the timeline, or to find that you don’t have the right people, or even that you have the wrong people and not the right people.
Retrospectives are a big time commitment we expect people to make and we need to make them count. People should know that when they show up to a retrospective that they're actually going to get something good out of it.
The facilitator is the most crucial role in this meeting.
The person should familiarize themself with the facts of the incident -- so ideally they are someone who is adjacent to the incident but not a primary responder, because they're going to be talking a lot in the meeting, and they shouldn’t be asking questions of themselves. The facilitator should know who was involved and why they were involved in the incident.
You should also build a timeline. This can be done by the facilitator while they're gathering all the facts for the retrospective. This is really important.
When I say build a timeline, I don't mean have everything down to the second of precision and every little tiny detail. It should be an overview.
Think of it as a narrative - how would you tell the story of this event? If you were telling a story, you would have a beginning, middle, and end. You’d cover salient points. And you probably wouldn’t be going for microsecond precision.
Any good engineer needs their tools. When I talk about tools, I don’t just mean stuff that you can check into a repo. I mean mental tools as well.
Here’s an overview of the tools I most commonly use to create engaging retrospectives. There’s nothing magical about any of these -- you can use them too.
I’ll take you through them.
Why chat? Audio transcriptions are error-prone and time consuming.
We run all our incidents, and indeed our day to day communications, in chat. That means everything has a transcript that you can refer back to. People can communicate in parallel -- you don't have to worry about interrupting someone on the voice bridge, and you don’t need someone to transcribe what’s happening on a voice bridge. You can copy and paste commands as needed.
I don’t care which type of chat you use, as long as you use chat.
Bot tools include incident management tools built on top of our chat bots. One example is here, where we recorded something for the timeline of this incident.
We deploy in chat, and deploys emit chat notifications. Pages alert in chat. We also have incident-management specific tools we wrote that can create notes for building a timeline or questions to follow up on while the incident is ongoing.
This makes the gathering information process for the retrospective much easier. It’s also great for transparency and discoverability amongst our engineers.
SitReps (or situation reports) are a common pattern in incident response anywhere.
You just want to periodic summary of the situation. This isn't what you're telling to customers -- this is what you're telling to people internally. You can of course use jargon, you can use acronyms, and you can you don't need to polish it the same level as you would customer-facing communications.
The goal is to make sure that responders have check points to guide themselves with as they work on the incident, especially as new folks come in.
These are also very helpful when you try to understand what happened after an incident -- sitreps give you milestones of what happened and when.
People underestimate the amount of time it takes to run a good retrospective. I'm not just talking about the time that it takes in the meeting. Prior preparation generally shortens the amount of time you all have to spend in a room together.
Block out time for yourself to prepare at least one day before the retro is scheduled.
Make sure all key players (including the incident coordinator and the communications people) are available and plan on attending the meeting. If someone crucial can’t attend, either reschedule or have someone who can speak for them show up instead (such as a team member).
Make sure you have a note-taker, someone who isn’t a primary responder so they won’t have to talk and take notes simultaneously.
In general, be organized. Send out the agenda, including the timeline, the day before. Make sure the room is booked ahead of time and A/V is working.
When everyone shows up with context retrospectives can get to the interesting bits faster. Who doesn’t love dissecting a failure in a complex system?
I love doing this and I know a lot of us do, because that’s why we’re in SRE.
So everyone is in the retrospective and the timeline is done. How do we start?
We set context, we keep it short, and we don't do the litany of timeline reading. Think of telling a story.
Have the most involved engineer give a brief summary of what happened. They should stick to the facts and really take less than five minutes. The goal is to make sure that everyone really orients themselves to What happened.
One thing I should say is that a retrospective should happen within a week of the incident. People should still have this relatively fresh in their mind by the time you go to retrospect. Otherwise you're wasting people's’ time, and you missed a chance to strike while the iron is hot and folks are feeling motivated to tackle remediations.
Once you are actually in the meeting you're going to want to read the room.
As a facilitator you need to make sure that everyone is engaged. You yourself need to be very present and active part of leading the discussion. Don't be the note taker -- make sure someone else is the note taker.
You'll need to ask questions of everyone, especially the quiet folks. Some people will want to dominate the conversation and some people will never want to jump in but that quiet person probably has some really good insights.
You should talk about customer impact!
We should be compassionate for what your customers felt during the outage. It's not just that you woke up at 3 AM because your database ran out of numbers -- your customer who might be running a business on your platform and maybe is around the world could have lost some valuable business, or some important work, and we need to be aware of that disruption.
Take note of interesting questions, statements, and points of confusion. This gives you jumping off points for deeper conversations.
When we’ve established context we can start diving into these things.
Once you have some starting points to start your questioning, dive in. There are various methods you can use to formulate questions for investigation.
A lot of people like the 5 whys -- I think that it’s interesting (it was created at Toyota) and very logical for engineers to grasp, but I like more flexible methods. I really like John Allspaw’s Infinite Hows. Asking “why” can frame the conversation in a more blameful way than asking “how”.
I don’t think this needs to be prescriptive, though. Simply don’t stop asking questions until you have gotten many layers deep.
Really really important -- if you ever get to human error, keep digging. Your systems are created and operated by humans for humans. Human error is a constant.
I cannot emphasize this enough! You have to work around and with human error.
Have you ever heard the phrase “Linux is user-friendly, it's just picky about its friends”? I disagree. Linux is dangerous. Complex and powerful tools can be dangerous. If you can take out your system with a typo your systems are too fragile, because someone is going to make a typo.
If someone skips a step or makes a typo due to exhaustion or in attention, that’s not on the engineer.
Always assume good intent. Humans get tired, humans get burnt out, humans get distracted. And humans run your systems.
When we build and maintain complex systems we have to develop interfaces for them that are as tolerant as possible to human frailty. The bonus here is that we like working with systems like this. Less friction and stress over using your tools means happier engineers, and happy engineers mean better work.
Usable, beautiful tools are an investment in scaling and reliability.
A reason to be very careful about respecting human failings is that we don't want to make people feel defensive.
When someone feels that they have to defend themselves, they throw up shields.
After that point, you won’t get useful information out of that retrospective. Folks need to feel safe to disclose mistakes they have made. That's how we find out how to fix these gaps in our tools.
One way you can tell a retrospective was good is in the end you have a ridiculous list of remediation items.
Remediations can be big and sweeping, to tiny and tactical, to completely absurd.
The ridiculous means you made it to the end of the questioning line!
Don’t feel you have to do every remediation that comes out of a retrospective. Give yourself the freedom to think about all the options and narrow them down afterwards. Narrow down what you can commit to only after you’ve been creative.
Don’t discount big projects either! That’s the really interesting work.
This is where it helps to understand your company’s process for bringing new work into engineering.
All too often we focus on remediations we can do quickly and within one team. We should be thinking more holistically.
Product is often really excited to hear new ideas. It’s their job to think about how to improve customer experience and what new things customers want.
SREs are great at finding problems and Product is great at finding solutions.
An example of something that came out of a common need for our engineers and our customers -- Heroku Pipelines. We use this for our own internal deployment flows! A lot of Heroku runs on Heroku.
Apps in a pipeline are grouped into “review”, “development”, “staging”, and “production” stages representing different deployment steps in a continuous delivery workflow.
You don’t have to build something huge to be customer facing. A lot of time SREs think of ourselves as internet plumbers (or janitors) -- no one knows we’re there until something’s broken. That’s valuable!
It’s also gratifying to see your work in front of a customer.
Don’t limit yourself to behind the scenes work. Don’t settle for tools that are unpleasant to use. Don’t prevent yourself from bringing up ideas because it will require cross-team or cross-functional collaboration.
You can improve your customers’ experience and your own.
Back to our war story. What did we actually do to fix our INT rollover problems?
Well, we added tooling to easily detect rollover conditions and give you a heads up to fix them before your database comes to a halt.
There’s a heroku postgres tool called pg:diagnose, and it will now alert you when 75% and then 90% of your integer sequence is consumed.
We also added process. There’s a productionization checklist that services should be going through before they hit production. We added an item to ensure sequences are in BIGINT. There’s no reason for us to use integer rather than bigint columns for sequences in Heroku Postgres.
https://www.flickr.com/photos/peretzpup/2361847171/
And of course we could and will improve.
We’d like to have this check scan our production databases automatically and alert before failure. Then of course, we could give that option to our customers.
https://www.flickr.com/photos/nnova/2967902322/
We also are sending pull requests to at least one common open source framework (yes, still looking at you, ActiveRecord) to set better defaults.
Thanks for sticking with me while I explain why I love failure.
We’re all going to fail at some point, and operating distributed systems means your odds get much higher. It’s way easier to fail when you remember that every failure is a chance to learn.
Make them count!
Some relevant links! I hope these help you.
Thank you for your time today.