Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

572 views

Published on

SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)

Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.

Published in: Technology
  • Be the first to comment

SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

  1. 1. DEVCLUB.LV 20/06/2019 SRE (service reliability engineer) on big DevOps platform running on the cloud Copyright © 2019 Accenture. All rights reserved.
  2. 2. 2 “RELIABILITY IS THE MOST IMPORTANT FEATURE OF ANY APPLICATION BUT IS OFTEN THE LEAST WELL DEFINED. WHY ARE WE HERE? WE NEED TO USE SRE TO HELP OUR CLIENTS TO CHANGE THAT AND KEEP THEIR FUTURES BRIGHT!” Copyright © 2019 Accenture. All rights reserved.
  3. 3. 3 https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html SITE RELIABILITY ENGINEERING (SRE) • Proclaimed by Google as how they do IT Operations • Invented by them in 2003 • First book published in 2016 (30 essays) • Read for free online: https://landing.google.com/sre/boo k/index.html • Workbook much more applied Copyright © 2019 Accenture.All rights reserved.
  4. 4. As per wikipedia, SRE can be defined as: “a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.” DEFINITION FROM GOOGLE SO IT’S A DISCIPLINE 4 Copyright © 2019 Accenture. All rights reserved.
  5. 5. SRE: WHEN OPERATIONS IS DESIGNED BY SOFTWARE ENGINEERS Modern Product Development requires more functionality introduced more frequently, creating more complexity and more support activities. Site reliability engineering (SRE) is part of the solution: a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra- scalable and highly reliable software systems. Applying principles of computer science and engineering to the design and development of highly available systems Proactively finding ways to make systems more scalable, reliable, and efficient until systems reach “desired reliability targets” Spanning a broad portfolio of software (applications, databases, cloud services) and hardware (network, data-center) assets SREs are engineers ... … focused on reliability ... … while operating services
  6. 6. Copyright © 2019 Accenture. All rights reserved. 6 “THERE IS NO SUCHTHING AS A NEW IDEA. IT IS IMPOSSIBLE. WE SIMPLYTAKE A LOT OF OLD IDEAS AND PUTTHEM INTO A SORT OF MENTAL KALEIDOSCOPE.WE GIVETHEM A TURN ANDTHEY MAKE NEW AND CURIOUS COMBINATIONS. WE KEEP ONTURNING AND MAKING NEW COMBINATIONS INDEFINITELY; BUTTHEY ARETHE SAME OLD PIECES OF COLORED GLASSTHAT HAVE BEEN IN USETHROUGH ALLTHE AGES. INSIGHT FROM MARKTWAIN Copyright © 2019 Accenture.All rights reserved.
  7. 7. MEASURING UP against other movements DevOps shouldn’t be like DevOps Copyright © 2019 Accenture.All rights reserved.
  8. 8. SRE versus DEVOPS Dev | Ops(wall of confusion) DevOps In practice typically leads to: • CI/CD pipelines • Infra-code at least for test environments • DevOps Team • Better quality engineering (hopefully) SRE In practice will hopefully lead to: • Higher reliability after code deploy • Better operability • Better life for Ops team • The right balance of speed vs safety But… does devOps always implement all of DevOps in reality? Copyright © 2019 Accenture. All rights reserved.
  9. 9. 9 class SRE implements DevOps SRE are focused on prescriptive way of measuring and achieving reliability through engineering and operations work Copyright © 2019 Accenture. All rights reserved.
  10. 10. 10 DevOps SRE Reduce organization silos Share ownership with developers by using the same and techniques across the stack Accept failure as normal Have a formula for balancing accidents and failures new releases Implement gradual change Encourage moving quickly by reducing costs of failure Leverage tooling & Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts bring long-term value to the system Measure everything Believes that operations is a software problem, and prescriptive ways for measuring availability, uptime, outages, toil, etc. Copyright © 2019 Accenture. All rights reserved.
  11. 11. Remediation Plugins FBAR API Facebook Operations API Monitoring API Hardware Power Control API Service Configuration API Site Operations Repair API • Sifts through 3.37 billion notifications from network devices each month • Filtering out noise down to roughly 750,000 alarms that need action • Of those, FBAR resolves 99.6 percent of the alarms without human intervention • Developed and maintained by two full time engineers • Doing the work of ~200 full time systems administrators SRE IMPACT EXAMPLE FACEBOOK AUTO-REMEDIATION SYSTEM (FBAR) Copyright © 2019 Accenture. All rights reserved.
  12. 12. Defining TOIL 12 “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. …one or more of the following…”: • Manual • Repetitive • Automatable • Tactical • No enduring value • O(n) with service growth https://landing.google.com/sre/book/chapters/eliminating-toil.html All types of SRE work: • Software engineering • Systems engineering • Toil • Overhead Copyright © 2019 Accenture.All rights reserved.
  13. 13. INNOVATING OPERATING SRE Principles 50% 50%Features Scaling Automation Software background Issues On-Call Manual Intervention Systems background Copyright © 2019 Accenture. All rights reserved.
  14. 14. • Service Level Agreement (SLA) • Defines the service availability for a customer and the penalties for breaking that availability • Question: what happens if the SLOs aren’t met? SRE usually not involved • Service Level Indicator (SLI) • Metrics over time which inform about the health of a service • Examples: Request latency Error rate System throughput Availability • Service Level Objective (SLO) • Agreed upon bounds for how often SLI’s must be met • Examples: LI ≤ target lower bound ≤ SLI ≤ upper bound SRE MEASURMENTS • SLIs and SLOs are the prescriptive way in which SRE practices the DevOps principle of "measure everything". Implementing SLOs also forces collaboration between product owners and systems operators, adhering to the DevOps principle of "break down organizational barriers".
  15. 15. MEASURE, MANAGE RISK 15 DEVELOPMENT OPERATIONS • Request Latency • Batch Throughput • Failures per Request SLI • 99th percentile Latency of requests received in the last 5 mins < 300 ms • Ratio of Errors/Total Request received in the last 5 mins <1% • Binding Targets for a collection of SLI’s SLO • Total amount of downtown over a year more or less than the ‘Objective’ of the Service • Agreement b/w a customer & service provider – typically based on SLO’sSLA SLIs drive SLOs which inform SLAs Copyright © 2019 Accenture. All rights reserved.
  16. 16. SRE Effectiveness? • SLA Compliance • System Compliance Profile • MTTR • Problem or Bug Age • Incident to Unique Root- Cause Ratio • Toil to Overall Effort Ratio • Service Performance (e.g. Page LoadTimes, Network Latency, etc.) • Infrastructure & Cloud Efficiency • Service ProvisioningCycle Time • Service Automation Ratio Given their breadth of scope, it becomes important to define performance and success metrics upon which the SRE is evaluated EXAMPLES
  17. 17. SRE Getting Started Think big… …start small… …scale fast. Talent, Organization and Culture  Align on the portfolio of Product Development services available and identify health indicators  Web app, MW services, SAP HR, etc.  Identify product owner & end-customer  What reliability expectations do they have? (availability, latency, etc.)  What indicators and mechanisms do you use to measure health today?  Identify value potential & path forward  Dependency on other services  Are reliability expectations realistic?  Does the team have the right telemetry in place to measure E2E health  Is team empowered and skilled to make changes to improve reliability? Set the Strategy Begin Implementation Transform the Organization  Select Product Development services that have the greatest need for reliability to the business – availability, stability, and performance  Consider agility and viability constraints  Service criticality (Maintenance state/EOL, critical to core, strategic)  Telemetry state, productivity tools  Organizational considerations – Consolidating multi-tiered groups into a single multi-modal group  Human capital strategy  Dependency on other services  Assemble pilot SRE team for identified Product Development service, define operating model, productivity measures, start running  Reflect on pilot team achievements across health and reliability metrics – MTTR, Availability, Performance, Incident ratios  Trend data over 30, 60, 90 days  Analyze SRE backlog – big rock projects  Talk about toil (work humans don’t wish to do)  Fine-tune, continue to improve  Initiate broader assessment and selection of Product Development services based on business need and viability (functional and technical)  Strategize SRE specialties based on nature of Product Development service portfolio to scale while remaining lean  SRE for custom web applications  SRE for storage infrastructure  SRE for all packaged back-office solutions CONSIDER IMPLEMENTATIONOF SREAS A CULTURALJOURNEY
  18. 18. ADOP STATE OF THE UNION
  19. 19. You can mobilize your ADOP toolset in less than 48 hours with 3 easy steps through our self-service portal 19 What ADOP Can Do for you DevOps processes on the ADOP integrated tooling environment have been known to reduce delivery costs substantially The platform support projects of all sizes - both enterprise-scale or smaller projects at a low cost & flexible subscription model ADOP includes ready-to-go pipelines and infrastructure automation branded cartridges for hundreds of technologies ADOP Support both Agile and Waterfall projects by driving increased productivity, quality, and lower risk ADOP: ACCENTURE DEVOPS PLATFORM
  20. 20. WHAT CAN YOU DO WITH ADOP? The platform is designed around technology extensions and re-usable components called cartridges, which further accelerate DevOps enablement. Document and Manage Project Scope Track Project Progress Build Code Artifacts and Products Deploy your Code to Any Environment Test your System Enforce Security Policy “Install” Accenture Best Practices
  21. 21. 2014 2015 2016 2017- 2018 ADOP/Enterprise History Launched Managed Jira within ALM Factory in Hoff Data Centre Re - Platforming to AWS cloud ADLM merges with ADOP ADOP CI/CD Offering Accenture DevOps Platform Projects using CI/CD 215+ on 300+ Masters 560+ Clients supported by ADOP SaaS Confluence 21.5K in last 3 months Jira 45K+ Total Users 11K+ Active in last 3 months 27M+ LOC Analysis total 17K+ Jenkins Job weekly Cloud 4 Clouds Account 330 EC2 1000 Containers 300TB data 500+ Security groups Accenture Security Compliant 600+ Tickets processed monthly PaaS capabilities Self Service Capabilities BY THE NUMBERS….

×