Site Reliability Engineering (SRE) is a new way of running large-scale software systems. Devised and popularised by Google, SRE is a specific set of disciplines and dynamics that work together with modern software engineering practices to help produce reliable software at scale. The SRE discipline combines deep awareness of technical infrastructure, operating systems and computer networking with attention to higher-level service level objectives (SLOs) to maintain a focus on business-relevant activities.
SRE requires new ways of organising work, new ways of hiring, and new modes of interaction between teams. We explore what these new approaches are and how they affect IT organisations.