SlideShare a Scribd company logo
1 of 31
Download to read offline
Confidential and Proprietary Information for Instana, Inc.
SRE:
A day in the life …
by Marcel Birkner
Confidential and Proprietary Information for Instana, Inc.
Bio
Marcel Birkner works as a Staff Reliability
Engineer at Instana, an Application
Performance Monitoring (APM) solution. He
has long experience in software
engineering and software automation.
Currently he focuses on migrating the
existing stack to Kubernetes and reducing
overall system complexity.
Confidential and Proprietary Information for Instana, Inc.
Abstract
What does a typical day as an SRE look like? In this presentation I will discuss
the challenges we face while running a SaaS platform that is used 24 / 7 / 365
around the globe. In doing so, we have embraced the core principles
described in the Google SRE handbook. While we are not Google by any
means, most of the principles apply to our daily work one way or another.
Having a fully distributed team running a distributed system can be quite
challenging. In this talk I’ll be covering:
● Core SRE principles
● How Instana has applied them to our daily work
● Lessons learned along the way
Confidential and Proprietary Information for Instana, Inc.
Who We Are
Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc.
SRE Team
1 Team
3 Time zones
24 / 7 / 365 support
On-call rotation
Members have
operations and
software
engineering
background
Team US
Team EU
Team AU
Confidential and Proprietary Information for Instana, Inc.
What We Do
Confidential and Proprietary Information for Instana, Inc.
Application
Tracing EUM
Mobile App
ServiceInfrastructure
https://play-with.instana.io
Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc.
Stats
280 TB / Month
Ingress
8 PB / Month
Cross AZ Traffic
30K+ ECU
8 different datastore
clusters / region
4K+ Containers
Running in SaaS
Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc.
SaaS Regions
Multi Cloud Strategy
2 x AWS regions
2 x GCP regions
HashiCorp
Nomad/Consul
Kubernetes
Confidential and Proprietary Information for Instana, Inc.
How We Do It
Confidential and Proprietary Information for Instana, Inc.
SRE by the book
OnCall
Post Mortems
SLI / SLO / SLA
RCA
Error Budgets
Database Ops
Dev Support
Software
Development
Cost Planning
Automation
Platform
Eng / Ops
...
Maintenance
Network Ops
Capacity Planning
Confidential and Proprietary Information for Instana, Inc.
Planned Day
● 30 min Handoff Team AU
● 50% Tickets/QoS
● 50% Project work
● 30 min Handoff Team US
Confidential and Proprietary Information for Instana, Inc.
Actual Day
● Handoff Team AU
● Alerts
● Ping by Engineering
● Ping by SE / PM
● Ping by CS
● Less project work than planned
● Handoff Team US
Learn to say “No”
Confidential and Proprietary Information for Instana, Inc.
Communication is vital
"Something is broken"
Engineering:
"Okay, will have a look"
Sales / CS:
"OMG" => Escalation to CEO => Escalates to VP Eng.
Private Slack Channels
tech-*
Avoid Panic
Confidential and Proprietary Information for Instana, Inc.
SRE Team Priorities
● Quality of Service of SaaS platform
● Platform Security
● regularly security scans
● Project Work
● Multi Cloud (AWS & GCP)
● Cost Management
● Migrate platform to Kubernetes
● Upgrade Elasticsearch clusters
● Integrating new datastore (BeeInstant)
● Support On-Premises
● Developer Support
● Packaging and Delivery
Confidential and Proprietary Information for Instana, Inc.
Google SRE Book
Part II: Principles
Embracing Risk
Service Level Objectives
Eliminating Toil
Monitoring Distributed Systems
Release Engineering
Simplicity
Confidential and Proprietary Information for Instana, Inc.
Embracing Risk
● Redundancy / HA / failover
● Datastore clusters across AZ
● Horizontal scaleout of components
● Costs
● Cost per monitored host
● K8s / Nomad Orchestration bin-packing
● Managing TU resources
● Beta Phase for new features
● Test using internal units
● Beta customers
● Coming soon: Error Budgets
Confidential and Proprietary Information for Instana, Inc.
Service Level Indicators / Objectives
● Custom SLOs for all components in SaaS platform
● SLO configuration stored and versioned with backend code
● Updated via REST API after each release
● Identical across all regions
● Managed by Engineering and SRE
Confidential and Proprietary Information for Instana, Inc.
Eliminating Toil
"The moment you have to do something twice, think about automating it"
Spin up new VM Jenkins + Terraform
Setup / Expand datastore cluster Chef recipes
Deploy / Update components Jenkins + instanactl
Run migrations Jenkins + instanactl
Configure Jenkins Job Jenkins Job DSL (all jobs are generated)
Configure DNS instanactl / external-dns (a few DNS entries are manually
configured)
Setup GKE cluster gcloud
Setup EKS cluster eksctl
Confidential and Proprietary Information for Instana, Inc.
Monitoring Distributed Systems
We use Instana to monitor Instana
● Datastores (Cassandra, ClickHouse, CockroachDB, Elasticsearch, Kafka,
ZooKeeper, ...)
● Infrastructure Monitoring
● Java DropWizard
● NodeJS
● Automatic Distributed tracing
● Automatic End-User-Monitoring
● Built-in alerting
Feedback Loop with PM & Engineering
Confidential and Proprietary Information for Instana, Inc.
Release Engineering
● Bi-Weekly Major Releases (Consistency)
● Continuous Release of Beta Features & Improvements & Hotfixes (24 / 7)
● Rotating Release Engineer
● Knowledge Sharing / Release Engineer Playbook
● Rollut for new K8s environments fully automated
● instanactl <environment> upgrade
■ check preconditions
■ run migrations
■ upgrade shared and tenant unit containers
■ check postconditions
● Post Mortem after each release / incident
● Improve / automate / refactor processes
Confidential and Proprietary Information for Instana, Inc.
Simplicity, Simplicity,
Simplicity, ...
Confidential and Proprietary Information for Instana, Inc.
Automatic Complexity - Infrastructure
New Regions
Multi Cloud
Infrastructure
automatically becomes
more complex over time
due to growth and other
external factors.
Confidential and Proprietary Information for Instana, Inc.
System architecture
automatically becomes
more complex when new
features are added over
time.
Automatic Complexity - Product
New Features
New Datastores
New Components
Confidential and Proprietary Information for Instana, Inc.
Work Towards Simplicity
Infrastructure Design
Network Design
Plan your infrastructure and network design for
growth and simplicity. Keep the overall system
as simple possible and only as complex as really
needed. This will make your life a lot easier
during your typical work day. In times of crisis
(i.e. outages) a simple system is easier to
understand for all engineers involved to resolve
the issue at hand.
First 5 years
Next 5 years
Confidential and Proprietary Information for Instana, Inc.
Common Codebase (SaaS / On-Premises)
up to 2019
Each datastore its migration tool
● Cassandra (cassandra-migrator)
● ClickHouse (golang-migrate)
● Elasticsearch (http-client)
● Kafka (kafka-cli)
● MongoDB (mongo migrator)
○ replaced by CockroachDB
● PostgreSQL (flyway db)
○ replaced by CockroachDB
Runtimes: Ruby/Python/Java
2020
instanactl
● GoLang CLI
○ cobra library
○ golang-migrate library
● used by SaaS and On-Premises
● single place for migration scripts
Runtimes: Single GoLang Binary
Confidential and Proprietary Information for Instana, Inc.
Common Codebase (SaaS / On-Premises)
up to 2019
● separate configuration
● separate packaging (Docker / Packages)
○ SaaS: Docker
○ OnPrem: RPM / DEB
● separate delivery (Ansible / Chef)
Runtimes: Python / Ruby
2020
● same configuration
● same Docker images
● same migration tool
○ instanactl
Runtimes: GoLang Binary & Docker
Supported Operating Systems
Ubuntu 16.04, 18.04 Debian 8.x, 9.x, 10.x RedHat 7.2+ CentOS 7.x Amazon Linux 2.x
Confidential and Proprietary Information for Instana, Inc.
Lessons Learned
Confidential and Proprietary Information for Instana, Inc.
Learn to say “No” Reduce Complexity
Know Your Tools
Focus and Prioritize
Work
Keep Tooling to a
Bare Minimum
Communicate
Across Teams
Share Knowledge
(SRE runbook, screen recordings, blogs)
Confidential and Proprietary Information for Instana, Inc.
Q&A
Confidential and Proprietary Information for Instana, Inc.
w w w . i n s t a n a . c o m

More Related Content

What's hot

SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!New Relic
 
Instana - ClickHouse presentation
Instana - ClickHouse presentationInstana - ClickHouse presentation
Instana - ClickHouse presentationMiel Donkers
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceFranklin Angulo
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform EngineeringOpsta
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101Sander Knape
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)Sebastian Poxhofer
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Abeer R
 

What's hot (20)

Observability
ObservabilityObservability
Observability
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Instana - ClickHouse presentation
Instana - ClickHouse presentationInstana - ClickHouse presentation
Instana - ClickHouse presentation
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform Engineering
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101
 
OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)OpenTelemetry: From front- to backend (2022)
OpenTelemetry: From front- to backend (2022)
 
The future of AIOps
The future of AIOpsThe future of AIOps
The future of AIOps
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 

Similar to SRE Day in Life at Instana

Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInC4Media
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
SAMKUMAR- Sr.Linux SystemAdministrator (1)
SAMKUMAR- Sr.Linux SystemAdministrator (1)SAMKUMAR- Sr.Linux SystemAdministrator (1)
SAMKUMAR- Sr.Linux SystemAdministrator (1)gandi samkumar
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodeKris Buytaert
 
Mukesh_Thakur_Wintel and VMware Admin
Mukesh_Thakur_Wintel and VMware AdminMukesh_Thakur_Wintel and VMware Admin
Mukesh_Thakur_Wintel and VMware AdminMukesh Thakur
 
Infrastructure as code
Infrastructure as codeInfrastructure as code
Infrastructure as codeNaseath Saly
 
6 Things You Need to Know to Safely Run Kubernetes
6 Things You Need to Know to Safely Run Kubernetes6 Things You Need to Know to Safely Run Kubernetes
6 Things You Need to Know to Safely Run KubernetesVMware Tanzu
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...StreamNative
 
Red Hat Summit - What are your digital foundations?
Red Hat Summit - What are your digital foundations?Red Hat Summit - What are your digital foundations?
Red Hat Summit - What are your digital foundations?Eric D. Schabell
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services
 
Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkDocker, Inc.
 

Similar to SRE Day in Life at Instana (20)

Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedIn
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
SAMKUMAR- Sr.Linux SystemAdministrator (1)
SAMKUMAR- Sr.Linux SystemAdministrator (1)SAMKUMAR- Sr.Linux SystemAdministrator (1)
SAMKUMAR- Sr.Linux SystemAdministrator (1)
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
 
Mukesh_Thakur_Wintel and VMware Admin
Mukesh_Thakur_Wintel and VMware AdminMukesh_Thakur_Wintel and VMware Admin
Mukesh_Thakur_Wintel and VMware Admin
 
Infrastructure as code
Infrastructure as codeInfrastructure as code
Infrastructure as code
 
6 Things You Need to Know to Safely Run Kubernetes
6 Things You Need to Know to Safely Run Kubernetes6 Things You Need to Know to Safely Run Kubernetes
6 Things You Need to Know to Safely Run Kubernetes
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
 
Red Hat Summit - What are your digital foundations?
Red Hat Summit - What are your digital foundations?Red Hat Summit - What are your digital foundations?
Red Hat Summit - What are your digital foundations?
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at Splunk
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

SRE Day in Life at Instana

  • 1. Confidential and Proprietary Information for Instana, Inc. SRE: A day in the life … by Marcel Birkner
  • 2. Confidential and Proprietary Information for Instana, Inc. Bio Marcel Birkner works as a Staff Reliability Engineer at Instana, an Application Performance Monitoring (APM) solution. He has long experience in software engineering and software automation. Currently he focuses on migrating the existing stack to Kubernetes and reducing overall system complexity.
  • 3. Confidential and Proprietary Information for Instana, Inc. Abstract What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe. In doing so, we have embraced the core principles described in the Google SRE handbook. While we are not Google by any means, most of the principles apply to our daily work one way or another. Having a fully distributed team running a distributed system can be quite challenging. In this talk I’ll be covering: ● Core SRE principles ● How Instana has applied them to our daily work ● Lessons learned along the way
  • 4. Confidential and Proprietary Information for Instana, Inc. Who We Are
  • 5. Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc. SRE Team 1 Team 3 Time zones 24 / 7 / 365 support On-call rotation Members have operations and software engineering background Team US Team EU Team AU
  • 6. Confidential and Proprietary Information for Instana, Inc. What We Do
  • 7. Confidential and Proprietary Information for Instana, Inc. Application Tracing EUM Mobile App ServiceInfrastructure https://play-with.instana.io
  • 8. Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc. Stats 280 TB / Month Ingress 8 PB / Month Cross AZ Traffic 30K+ ECU 8 different datastore clusters / region 4K+ Containers Running in SaaS
  • 9. Confidential and Proprietary Information for Instana, Inc.Confidential and Proprietary Information for Instana, Inc. SaaS Regions Multi Cloud Strategy 2 x AWS regions 2 x GCP regions HashiCorp Nomad/Consul Kubernetes
  • 10. Confidential and Proprietary Information for Instana, Inc. How We Do It
  • 11. Confidential and Proprietary Information for Instana, Inc. SRE by the book OnCall Post Mortems SLI / SLO / SLA RCA Error Budgets Database Ops Dev Support Software Development Cost Planning Automation Platform Eng / Ops ... Maintenance Network Ops Capacity Planning
  • 12. Confidential and Proprietary Information for Instana, Inc. Planned Day ● 30 min Handoff Team AU ● 50% Tickets/QoS ● 50% Project work ● 30 min Handoff Team US
  • 13. Confidential and Proprietary Information for Instana, Inc. Actual Day ● Handoff Team AU ● Alerts ● Ping by Engineering ● Ping by SE / PM ● Ping by CS ● Less project work than planned ● Handoff Team US Learn to say “No”
  • 14. Confidential and Proprietary Information for Instana, Inc. Communication is vital "Something is broken" Engineering: "Okay, will have a look" Sales / CS: "OMG" => Escalation to CEO => Escalates to VP Eng. Private Slack Channels tech-* Avoid Panic
  • 15. Confidential and Proprietary Information for Instana, Inc. SRE Team Priorities ● Quality of Service of SaaS platform ● Platform Security ● regularly security scans ● Project Work ● Multi Cloud (AWS & GCP) ● Cost Management ● Migrate platform to Kubernetes ● Upgrade Elasticsearch clusters ● Integrating new datastore (BeeInstant) ● Support On-Premises ● Developer Support ● Packaging and Delivery
  • 16. Confidential and Proprietary Information for Instana, Inc. Google SRE Book Part II: Principles Embracing Risk Service Level Objectives Eliminating Toil Monitoring Distributed Systems Release Engineering Simplicity
  • 17. Confidential and Proprietary Information for Instana, Inc. Embracing Risk ● Redundancy / HA / failover ● Datastore clusters across AZ ● Horizontal scaleout of components ● Costs ● Cost per monitored host ● K8s / Nomad Orchestration bin-packing ● Managing TU resources ● Beta Phase for new features ● Test using internal units ● Beta customers ● Coming soon: Error Budgets
  • 18. Confidential and Proprietary Information for Instana, Inc. Service Level Indicators / Objectives ● Custom SLOs for all components in SaaS platform ● SLO configuration stored and versioned with backend code ● Updated via REST API after each release ● Identical across all regions ● Managed by Engineering and SRE
  • 19. Confidential and Proprietary Information for Instana, Inc. Eliminating Toil "The moment you have to do something twice, think about automating it" Spin up new VM Jenkins + Terraform Setup / Expand datastore cluster Chef recipes Deploy / Update components Jenkins + instanactl Run migrations Jenkins + instanactl Configure Jenkins Job Jenkins Job DSL (all jobs are generated) Configure DNS instanactl / external-dns (a few DNS entries are manually configured) Setup GKE cluster gcloud Setup EKS cluster eksctl
  • 20. Confidential and Proprietary Information for Instana, Inc. Monitoring Distributed Systems We use Instana to monitor Instana ● Datastores (Cassandra, ClickHouse, CockroachDB, Elasticsearch, Kafka, ZooKeeper, ...) ● Infrastructure Monitoring ● Java DropWizard ● NodeJS ● Automatic Distributed tracing ● Automatic End-User-Monitoring ● Built-in alerting Feedback Loop with PM & Engineering
  • 21. Confidential and Proprietary Information for Instana, Inc. Release Engineering ● Bi-Weekly Major Releases (Consistency) ● Continuous Release of Beta Features & Improvements & Hotfixes (24 / 7) ● Rotating Release Engineer ● Knowledge Sharing / Release Engineer Playbook ● Rollut for new K8s environments fully automated ● instanactl <environment> upgrade ■ check preconditions ■ run migrations ■ upgrade shared and tenant unit containers ■ check postconditions ● Post Mortem after each release / incident ● Improve / automate / refactor processes
  • 22. Confidential and Proprietary Information for Instana, Inc. Simplicity, Simplicity, Simplicity, ...
  • 23. Confidential and Proprietary Information for Instana, Inc. Automatic Complexity - Infrastructure New Regions Multi Cloud Infrastructure automatically becomes more complex over time due to growth and other external factors.
  • 24. Confidential and Proprietary Information for Instana, Inc. System architecture automatically becomes more complex when new features are added over time. Automatic Complexity - Product New Features New Datastores New Components
  • 25. Confidential and Proprietary Information for Instana, Inc. Work Towards Simplicity Infrastructure Design Network Design Plan your infrastructure and network design for growth and simplicity. Keep the overall system as simple possible and only as complex as really needed. This will make your life a lot easier during your typical work day. In times of crisis (i.e. outages) a simple system is easier to understand for all engineers involved to resolve the issue at hand. First 5 years Next 5 years
  • 26. Confidential and Proprietary Information for Instana, Inc. Common Codebase (SaaS / On-Premises) up to 2019 Each datastore its migration tool ● Cassandra (cassandra-migrator) ● ClickHouse (golang-migrate) ● Elasticsearch (http-client) ● Kafka (kafka-cli) ● MongoDB (mongo migrator) ○ replaced by CockroachDB ● PostgreSQL (flyway db) ○ replaced by CockroachDB Runtimes: Ruby/Python/Java 2020 instanactl ● GoLang CLI ○ cobra library ○ golang-migrate library ● used by SaaS and On-Premises ● single place for migration scripts Runtimes: Single GoLang Binary
  • 27. Confidential and Proprietary Information for Instana, Inc. Common Codebase (SaaS / On-Premises) up to 2019 ● separate configuration ● separate packaging (Docker / Packages) ○ SaaS: Docker ○ OnPrem: RPM / DEB ● separate delivery (Ansible / Chef) Runtimes: Python / Ruby 2020 ● same configuration ● same Docker images ● same migration tool ○ instanactl Runtimes: GoLang Binary & Docker Supported Operating Systems Ubuntu 16.04, 18.04 Debian 8.x, 9.x, 10.x RedHat 7.2+ CentOS 7.x Amazon Linux 2.x
  • 28. Confidential and Proprietary Information for Instana, Inc. Lessons Learned
  • 29. Confidential and Proprietary Information for Instana, Inc. Learn to say “No” Reduce Complexity Know Your Tools Focus and Prioritize Work Keep Tooling to a Bare Minimum Communicate Across Teams Share Knowledge (SRE runbook, screen recordings, blogs)
  • 30. Confidential and Proprietary Information for Instana, Inc. Q&A
  • 31. Confidential and Proprietary Information for Instana, Inc. w w w . i n s t a n a . c o m