SlideShare a Scribd company logo
1 of 16
Download to read offline
Robin van Zijll & Janna Brummel 24 May 2018
How We Try to Make a Lion Bulletproof
Setting Up SRE in a Global Financial Organization
2
Introductions
Janna Brummel
IT Chapter Lead SRE
Robin van Zijll
Product Owner SRE
ING is a global financial service provider servicing more than 35 million customers. In the
Netherlands we are the banking sector market leader with over 8 million retail customers
3
Customers
35 million
private, corporate and
institutional customers
Countries
more than 40
In Europe, Asia, Australia,
North and South America
Employees
52,000 worldwide
12,416 in NL
Market leaders Benelux
Growth markets
Commercial Banking
Challengers
4
Mobile Banking used by
3,5 million customers
who generate 4,4 million
log ins per day.
Internet Banking used
by 6,1 million customers
who jointly log in 1,4
million times a day.
17400 machines are
spread over 2 data
centers and use 14 PB of
storage.
99.72
99.65
0.19
0.22
0.09
0.14
99.40%
99.60%
99.80%
100.00%
Internet Banking Retail Mobile Banking Retail
Availability Report of 2017
Availability Change Incident
Why do we need to improve the reliability of our services?
5
Site Reliability Engineering, as pioneered by Google, is doing
work historically done by operations teams but using
engineers who aim is to automate the toil within their
organization.
By design, it is crucial that SRE teams are focused on
engineering. There is a 50% cap on operational work (tickets,
on-call, manual tasks) and at least 50% of SRE time should
be spent on engineering.
Site Reliability Engineering (SRE) is what happens when you ask a software
engineer to design an operations team
6
Within ING we have a number of challenges related to our reliability that we
want to solve through SRE
7
Teams are not in control of monitoring solutions and
cannot fix it when broken.
It takes too long for an alert to reach the right team: on
average we need 69 minutes before an engineer starts
working an incident resolution.
We do not learn enough from mistakes made – we have
yet to become a learning organization.
We prove we are in control with documents, not by
checking the actual state of our code in production.
Teams are not always aware of their services’
performance and cannot take full responsibility for run.
Our centralized monitoring solutions sometimes
encounter scalability and availability issues.
Our centralized alerting solution is unreliable and
does not send alerts directly to BizDevOps teams.
The same incidents occur multiple times and
we do not follow up on incidents enough.
Our engineers spend more time on completing
documents than coding.
Teams do not always measure availability
from a white box monitoring perspective.
We have adopted the Spotify model and work in Tribes composed of
BizDevOps squads: our SRE team is positioned centrally within NL as a silo
8
SRE
enable & supportCL
PO
Product
Development
Capacity Planning
Testing + Release
Procedures
Postmortem/Root Cause Analysis
Incident Response
Monitoring
Our SRE team enables engineering teams through delivery of tooling,
facilitation, consulting and education
9
We facilitate BizDevOps squads during post mortems
and consult whenever our help is needed in fixing or
identifying reliability issues.
We build tooling to enable BizDevOps squads. At the moment
we focus on Prometheus (alerting, white box monitoring and
traffic modeling) and Mattermost (ChatOps).
We educate others about SRE during demos and
we develop training materials.
We facilitate the creation of more SRE teams and
ask them to join our SRE community meetings
with the other NL-based SRE teams.
We are not on call: BizDevOps teams are
responsible for their own build and run.
We aim to reduce our time to repair through engineering by improving our
monitoring with Prometheus and introducing ChatOps with MatterMost
10
pull metrics
queries
push alerts
Prometheus &
Alert manager
And now for the E in SRE: Introducing the Reliability Toolkit
11
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Tools Metrics
NLA
Client libraries in engineering frameworks
CollectD
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
Alert
Manager
Model
Builder
SMS
E-mail
ChatOps
12
14
Our learnings after two years of SRE at ING
15
People
Process
Technology
▪ Never compromise on mindset in hiring SREs.
▪ Assign a PO to protect team focus on engineering and to spread the SRE love.
▪ Consider what mix works well for you in terms of new and existing hires, or think about
possibilities of SRE internships.
▪ Test if SRE works for you by doing a pilot phase.
▪ Have a vision on your definition of SRE as a team, define a roadmap together.
▪ Learn from others through online resources, at conferences or company visits.
▪ Prepare to spend time explaining and promoting SRE and your tooling.
▪ Beer o’clock is great for team bonding.
▪ Make it attractive for others to use your tooling: take away pain from teams,
incorporate your tooling in widely used frameworks, find quick wins.
▪ Productization takes time, a lot of time. Don’t underestimate this.
▪ Consider scalability and ownership in your tooling strategy.
Questions?

More Related Content

Similar to How We Try to Make a Lion Bulletproof; Setting up SRE in a Global Financial Organization

Ariba Female CEO's Feature Summer 2001 Emily Brady
Ariba Female CEO's Feature Summer 2001 Emily BradyAriba Female CEO's Feature Summer 2001 Emily Brady
Ariba Female CEO's Feature Summer 2001 Emily Brady
ebrady
 
Intellectsoft Overview
Intellectsoft OverviewIntellectsoft Overview
Intellectsoft Overview
Ryan Nguyen
 

Similar to How We Try to Make a Lion Bulletproof; Setting up SRE in a Global Financial Organization (20)

eLuminous Technologies Pvt Ltd. - Company Overview.
eLuminous Technologies Pvt Ltd. - Company Overview.eLuminous Technologies Pvt Ltd. - Company Overview.
eLuminous Technologies Pvt Ltd. - Company Overview.
 
eLuminous Technologies - Business Overview 2016
eLuminous Technologies - Business Overview 2016eLuminous Technologies - Business Overview 2016
eLuminous Technologies - Business Overview 2016
 
About_ITV_one
About_ITV_oneAbout_ITV_one
About_ITV_one
 
Ariba Female CEO's Feature Summer 2001 Emily Brady
Ariba Female CEO's Feature Summer 2001 Emily BradyAriba Female CEO's Feature Summer 2001 Emily Brady
Ariba Female CEO's Feature Summer 2001 Emily Brady
 
Marketing scrum at VODW dag
Marketing scrum at VODW dagMarketing scrum at VODW dag
Marketing scrum at VODW dag
 
Surge engr 245 lean launchpad stanford 2020
Surge engr 245 lean launchpad stanford 2020Surge engr 245 lean launchpad stanford 2020
Surge engr 245 lean launchpad stanford 2020
 
Intellectsoft Overview
Intellectsoft OverviewIntellectsoft Overview
Intellectsoft Overview
 
The Future of Business Intelligence - What's On The Horizon, And How CIOs Can...
The Future of Business Intelligence - What's On The Horizon, And How CIOs Can...The Future of Business Intelligence - What's On The Horizon, And How CIOs Can...
The Future of Business Intelligence - What's On The Horizon, And How CIOs Can...
 
SiboneloDlaminiPOE
SiboneloDlaminiPOESiboneloDlaminiPOE
SiboneloDlaminiPOE
 
Hashroot Technologies | Server Management | Cloud Management | Security Servi...
Hashroot Technologies | Server Management | Cloud Management | Security Servi...Hashroot Technologies | Server Management | Cloud Management | Security Servi...
Hashroot Technologies | Server Management | Cloud Management | Security Servi...
 
ICS - Introduction
ICS - IntroductionICS - Introduction
ICS - Introduction
 
Spritle corp
Spritle corpSpritle corp
Spritle corp
 
Lscon16 414 Gaining Executive Buy-in For Your Learning Ecosystem
Lscon16 414 Gaining Executive Buy-in For Your Learning EcosystemLscon16 414 Gaining Executive Buy-in For Your Learning Ecosystem
Lscon16 414 Gaining Executive Buy-in For Your Learning Ecosystem
 
How to successfully outsource for your small business
How to successfully outsource for your small businessHow to successfully outsource for your small business
How to successfully outsource for your small business
 
DoIT outsourcing in Ukraine
DoIT outsourcing in UkraineDoIT outsourcing in Ukraine
DoIT outsourcing in Ukraine
 
Mindbowser Infosolutions Portfolio - 2019
Mindbowser Infosolutions Portfolio - 2019Mindbowser Infosolutions Portfolio - 2019
Mindbowser Infosolutions Portfolio - 2019
 
The Interim CIO
The Interim CIOThe Interim CIO
The Interim CIO
 
Proposal for pos
Proposal for posProposal for pos
Proposal for pos
 
Proposal for pos
Proposal for posProposal for pos
Proposal for pos
 
Robotic Process Automation Webinar Slides
Robotic Process Automation Webinar SlidesRobotic Process Automation Webinar Slides
Robotic Process Automation Webinar Slides
 

More from J On The Beach

Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
J On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
J On The Beach
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
J On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
J On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

How We Try to Make a Lion Bulletproof; Setting up SRE in a Global Financial Organization

  • 1. Robin van Zijll & Janna Brummel 24 May 2018 How We Try to Make a Lion Bulletproof Setting Up SRE in a Global Financial Organization
  • 2. 2 Introductions Janna Brummel IT Chapter Lead SRE Robin van Zijll Product Owner SRE
  • 3. ING is a global financial service provider servicing more than 35 million customers. In the Netherlands we are the banking sector market leader with over 8 million retail customers 3 Customers 35 million private, corporate and institutional customers Countries more than 40 In Europe, Asia, Australia, North and South America Employees 52,000 worldwide 12,416 in NL Market leaders Benelux Growth markets Commercial Banking Challengers
  • 4. 4 Mobile Banking used by 3,5 million customers who generate 4,4 million log ins per day. Internet Banking used by 6,1 million customers who jointly log in 1,4 million times a day. 17400 machines are spread over 2 data centers and use 14 PB of storage.
  • 5. 99.72 99.65 0.19 0.22 0.09 0.14 99.40% 99.60% 99.80% 100.00% Internet Banking Retail Mobile Banking Retail Availability Report of 2017 Availability Change Incident Why do we need to improve the reliability of our services? 5
  • 6. Site Reliability Engineering, as pioneered by Google, is doing work historically done by operations teams but using engineers who aim is to automate the toil within their organization. By design, it is crucial that SRE teams are focused on engineering. There is a 50% cap on operational work (tickets, on-call, manual tasks) and at least 50% of SRE time should be spent on engineering. Site Reliability Engineering (SRE) is what happens when you ask a software engineer to design an operations team 6
  • 7. Within ING we have a number of challenges related to our reliability that we want to solve through SRE 7 Teams are not in control of monitoring solutions and cannot fix it when broken. It takes too long for an alert to reach the right team: on average we need 69 minutes before an engineer starts working an incident resolution. We do not learn enough from mistakes made – we have yet to become a learning organization. We prove we are in control with documents, not by checking the actual state of our code in production. Teams are not always aware of their services’ performance and cannot take full responsibility for run. Our centralized monitoring solutions sometimes encounter scalability and availability issues. Our centralized alerting solution is unreliable and does not send alerts directly to BizDevOps teams. The same incidents occur multiple times and we do not follow up on incidents enough. Our engineers spend more time on completing documents than coding. Teams do not always measure availability from a white box monitoring perspective.
  • 8. We have adopted the Spotify model and work in Tribes composed of BizDevOps squads: our SRE team is positioned centrally within NL as a silo 8 SRE enable & supportCL PO
  • 9. Product Development Capacity Planning Testing + Release Procedures Postmortem/Root Cause Analysis Incident Response Monitoring Our SRE team enables engineering teams through delivery of tooling, facilitation, consulting and education 9 We facilitate BizDevOps squads during post mortems and consult whenever our help is needed in fixing or identifying reliability issues. We build tooling to enable BizDevOps squads. At the moment we focus on Prometheus (alerting, white box monitoring and traffic modeling) and Mattermost (ChatOps). We educate others about SRE during demos and we develop training materials. We facilitate the creation of more SRE teams and ask them to join our SRE community meetings with the other NL-based SRE teams. We are not on call: BizDevOps teams are responsible for their own build and run.
  • 10. We aim to reduce our time to repair through engineering by improving our monitoring with Prometheus and introducing ChatOps with MatterMost 10 pull metrics queries push alerts Prometheus & Alert manager
  • 11. And now for the E in SRE: Introducing the Reliability Toolkit 11 Alert Manager Model Builder SMS E-mail ChatOps Tools Metrics NLA Client libraries in engineering frameworks CollectD Alert Manager Model Builder SMS E-mail ChatOps Alert Manager Model Builder SMS E-mail ChatOps Alert Manager Model Builder SMS E-mail ChatOps
  • 12. 12
  • 13.
  • 14. 14
  • 15. Our learnings after two years of SRE at ING 15 People Process Technology ▪ Never compromise on mindset in hiring SREs. ▪ Assign a PO to protect team focus on engineering and to spread the SRE love. ▪ Consider what mix works well for you in terms of new and existing hires, or think about possibilities of SRE internships. ▪ Test if SRE works for you by doing a pilot phase. ▪ Have a vision on your definition of SRE as a team, define a roadmap together. ▪ Learn from others through online resources, at conferences or company visits. ▪ Prepare to spend time explaining and promoting SRE and your tooling. ▪ Beer o’clock is great for team bonding. ▪ Make it attractive for others to use your tooling: take away pain from teams, incorporate your tooling in widely used frameworks, find quick wins. ▪ Productization takes time, a lot of time. Don’t underestimate this. ▪ Consider scalability and ownership in your tooling strategy.