SlideShare a Scribd company logo
Confidential / © Harness Inc. 2020
Bringing order to Chaos
Make your systems more resilient
with Chaos Engineering
Confidential / © Harness Inc. 2021
Sayan Mondal
Senior Software Engineer II at
Maintainer at
@s_ayanide
@s-ayanide ● Senior Software Engineer 2 at Harness
● Maintainer of LitmusChaos (CNCF incubating
project) for 2 years, contributing since 3.5 years
● Volunteer and mentor at Linux Foundation
● Chaos Carnival organizing team member
Confidential / © Harness 2022
What causes Downtime
Application Failures
Reputational Impact Financial Impact Poor User Experience
Slack’s Outages Est. >$55M in losses to WF 75,000+ passengers travel
plans impacted
Infrastructure Failures Operational Failures
Confidential / © Harness Inc. 2021
The cloud native problem
Proliferation of applications
into micro services leads to a
RELIABILITY challenge
In cloud native, your code depends on hundreds
of other microservices and runs on many
platforms. The potential of being subjected to a
dependent component failure is huge.
1
Your Application
3
Cloud Native Services
CoreDNS, Envoy,
Prometheus, OpenEBS, etc.
5
Platform Services
Infrastructure
2
Your Application’s
Dependencies
MongoDB, Kafka, TiKV,
Vitess, Postgres, etc.
4
Kubernetes Services
Confidential / © Harness Inc. 2021
Problems with existing solutions
Not automated
Not collaborative
Reactive Approach
● No proactive investments for
failure testing
● Generally driven by root
cause analysis
● No proactive investments for
failure testing
● Generally driven by root
cause analysis
● Driven by Ops
● Not integrated into CI/CD or
Gamedays
Confidential / © Harness Inc. 2021
The solution? Chaos Engineering
SREs + Developers
Experiments are in Git just like code
Chaos engineering is collaborative
Collaborative chaos experiments in
a centralized control plane
Optimize initial investment
Reduce the inertia for starting chaos
Robust Experiments
Public and private chaos hubs with
ready to use experiments
Find weaknesses during build/test phase
Verifying at dev stage saves money
Integrate into CI/CD systems
Rollout automated and controlled
chaos experiments across
prod/non-prod environments
Measure the impact of inducing chaos
Build confidence by starting small
Enables observability for Chaos
Chaos metrics used to assess
impact and manage SLOs/Errors
Confidential / © Harness Inc. 2021
How to do Chaos Engineering?
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
Confidential / © Harness Inc. 2021
Project
LitmusChaos -
a CNCF incubating project
Started in 2017; 4+ years of active development
350K+ Litmus installations; 30x usage growth in the last 3
quarters, 50+ chaos experiments, 100+ contributors
Stable platform : 2.0 released
50+ enterprises using 2.0
CNCF Incubating project
Litmus is an open source platform for practicing chaos
engineering in a cloud native way.
CNCF
Incubating
project 30x growth in per-day installations of Litmus
in the last 3 quarters; 1500 installations per day
Litmus is adopted by
Confidential / © Harness Inc. 2021
What’s Exciting about LitmusChaos?
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
Chaos
Workflow editor
User
ChaosCenter
Management &
Teaming
ChaosAgent
Custom Image
Registry
GitOps
ChaosCenter
Monitoring and
Observability
Resilience
Score
Calculation
ChaosHub
Scheduling
More Control
over Chaos
Results
New &
Enhanced
Experiments
Support for
Cri-o,
ContainerD and
Docker
Confidential / © Harness Inc. 2021
Architecture Overview of LitmusChaos 2.0
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
Confidential / © Harness Inc. 2021
Chaos Workflow Deep Dive
The ChaosCenter is a single source of truth to control
all the different Chaos Activities happening around
Litmus. From the ChaosCenter you get the freedom to
manage every single part of Litmus and shape your
workflows exactly the way you want it.
A ChaosAgent in Litmus is nothing but the target
cluster where Chaos would be injected via Litmus.
There should always be at least one or more than
one ChaosAgents connected to the ChaosCenter.
Each individual ChaosAgent can be chosen to be
the Target Agent for Chaos Injection.
Core Components of LitmusChaos
“ “
ChaosCenter
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
ChaosAgent
Confidential / © Harness Inc. 2021
Variety of faults offered in LitmusChaos
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
Pod Chaos Node Chaos Network Chaos Stress Chaos Cloud Services Application Chaos
Pod Failure
Container Kill
Pod Autoscale
Node Drain
Forced Eviction (Node Taints)
Node Restart/PowerOff
Network Latency
Packet Loss
Network Corruption,
Duplication
Pod, Node CPU Hog
Pod, Node Memory Hog
Pod, Node Disk Stress
Pod Ephemeral Storage
Fill
AWS EKS EC2 Termination
AWS EBS Disk Detach
GCP GPD Disk Detach
Kafka Leader Broker Failure
Cassandra Ring Disruption
OpenEBS Control Plane / Volume
Failure
Confidential / © Harness Inc. 2020
The Features
Let’s take a look at the core
features offered
Confidential / © Harness Inc. 2020
Chaos Center
Chaos Workflows
Automate dependency setup,
create complex chaos
scenarios, support definition of
load/validation jobs along with
chaos injection
Multiple options
From Templates, Custom
Workflows from Scratch (using
ChaosHubs), From pre-created
YAMLs
Crontrol
Chaos Experiments Sequence
Control (Parallel as well as
Sequential steps creation)
Schedules
Creation of either Singular or
Cron Workflows as Schedules
Experiment Priority
Attaching priority to Chaos
Experiments based on your use
cases
Confidential / © Harness Inc. 2020
Workflow Management
GitOps
Rolling out automated changes
using GitOps
Custom Image
Allowing image addition from
custom image server (both
public and private)
Resilience Score
Measure and Analyse the
Resilience Score of each
workflow
Confidential / © Harness Inc. 2020
Multi Tenancy
Scope Support
Supports setup (control plane &
agents) and execution of chaos
experiments in both cluster
scoped and namespace scoped
modes.
Authentication
Authentication and a smooth
onboarding process. Choose
between email and password
auth or OAuth with Google or
GitHub for your teams.
Create Teams
Creating a Team of multiple
Users and Project Management
Fine-Grained RBAC
Flexible RBAC to drill down
and grant correct privileges
to users.
Confidential / © Harness Inc. 2020
Monitoring & Observability
Connect Datasource
Connect a Data Source (from
any Agent) and monitor
workflows
Visualization
Visualize workflow run statistics
and aggregated schedules
Comparison
Compare two or more
Workflows
Upload Dashboards
Upload shared/downloadable
dashboards available in the
community
Tune Dashboards
Edit queries, Tune dashboards
to create a custom one from
scratch
Monitor in Real Time
Monitor effect of chaos in real
time with interleaved events
and metrics from Prometheus
Datasource
Confidential / © Harness Inc. 2020
GitOps for Chaos
Git based SCM
Integrates with Git-based SCM
to provide a single source of
truth for chaos artifacts
(workflows), such that changes
are synchronized bi-directionally
b/w the git source and the chaos
center - thereby pulling the
latest artifact for execution.
Tracking
Provides an event-tracker
microservice to automatically
launch “subscribed” chaos
workflows upon app upgrades
affected by GitOps tools like
ArgoCD, Flux
Confidential / © Harness Inc. 2020
Non Kubernetes Chaos
Chaos on Infra
Inject chaos on infrastructure
resources such as
VMs/instances and disks (AWS,
GCP, Azure, VMWare)
Attack Baremetal
Introduces chaos experiments
to bring down baremetal nodes
that provide IPMI-based
out-of-band access.
Chaos on Machine
Litmus has developed m-agent:
a platform generic daemon
agent for orchestrating chaos
into any computing node.
Confidential / © Harness Inc. 2020
Hands On
Demo Time
Confidential / © Harness Inc. 2020
Install the components
Confidential / © Harness Inc. 2020
Pick the faults
Confidential / © Harness Inc. 2020
Inject Chaos in your application
Confidential / © Harness Inc. 2020
Observe Impact
Confidential / © Harness Inc. 2021
Seamless support for cross cloud connectivity
and interactions. Target you applications running
on your preferred cloud provider with Litmusctl.
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
Multi-Cloud Support with LitmusChaos 2.0
Future Roadmap What’s ahead for us?
● Increased support for chaos against Non-Kubernetes infrastructure components
● More Application specific chaos experiments with native faults and health checks
● Improved Chaos SDK for creation of user-defined experiments
● Additional probe types for diverse steady state-hypothesis validation
● Improved Observability for chaos experiments
● More community supported Chaos Types
Confidential / © Harness Inc. 2021
How to Contribute?
Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
We welcome contributions of all kinds
● Development of features, bug fixes, and
other improvements.
● Documentation including reference
material and examples.
● Bug and feature reports
You can choose from a list of sub-dependent
repos to contribute to, a few highlighted repos
that Litmus uses are:
● Chaos-charts
● Chaos-workflows
● Test-tools
● Litmus UI
● Litmus-go
● website-litmuschaos
Confidential / © Harness Inc. 2021
Conclusion
● Chaos Engineering
● A tool called LitmusChaos
● Architecture principle
● Core Components of Chaos Induction
● Demo
Conclusion
Confidential / © Harness Inc. 2020
/LitmusChaos
/litmuschaos
Follow Litmus on
Thank You
@s_ayanide
/s-ayanide
Contact me on

More Related Content

Similar to stackconf 2023 | Bringing Order to Chaos: Make Your Systems More Resilient with Chaos Engineering by Sayan Mondal.pdf

Unlock the Cloud: Building a Vendor Independent Private Cloud
Unlock the Cloud: Building a Vendor Independent Private CloudUnlock the Cloud: Building a Vendor Independent Private Cloud
Unlock the Cloud: Building a Vendor Independent Private Cloud
Abiquo, Inc.
 
PHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on BluemixPHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on Bluemix
IBM
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developers
Daniel Krook
 
Containers vs. VMs: It's All About the Apps!
Containers vs. VMs: It's All About the Apps!Containers vs. VMs: It's All About the Apps!
Containers vs. VMs: It's All About the Apps!
Steve Wilson
 
Enterprise serverless
Enterprise serverlessEnterprise serverless
Enterprise serverless
DmitryLozitskiy2
 
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Dmitry Lazarenko
 
Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016
Stormy Peters
 
CIT-2697 - Customer Success Stories with IBM PureApplication System
CIT-2697 - Customer Success Stories with IBM PureApplication SystemCIT-2697 - Customer Success Stories with IBM PureApplication System
CIT-2697 - Customer Success Stories with IBM PureApplication System
Hendrik van Run
 
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Cloud Native Day Tel Aviv
 
IBM RedHat OCP Vs xKS.pptx
IBM RedHat OCP Vs xKS.pptxIBM RedHat OCP Vs xKS.pptx
IBM RedHat OCP Vs xKS.pptx
ssuser666667
 
Let's banish "it works on my machine"
Let's banish "it works on my machine"Let's banish "it works on my machine"
Let's banish "it works on my machine"
Stephanie Locke
 
Cloud 12 08 V2
Cloud 12 08 V2Cloud 12 08 V2
Cloud 12 08 V2
Pini Cohen
 
Cloud to Edge
Cloud to EdgeCloud to Edge
Cloud to Edge
Wesley Reisz
 
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with IstioDevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
Divine Odazie
 
Containerize, PaaS, or Go Serverless!?
Containerize, PaaS, or Go Serverless!?Containerize, PaaS, or Go Serverless!?
Containerize, PaaS, or Go Serverless!?
Phil Estes
 
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
cornelia davis
 
The Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian CockcroftThe Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian Cockcroft
Dun & Bradstreet Cloud Innovation Center
 
VMware Tanzu Service Mesh from the Developer’s Perspective
VMware Tanzu Service Mesh from the Developer’s PerspectiveVMware Tanzu Service Mesh from the Developer’s Perspective
VMware Tanzu Service Mesh from the Developer’s Perspective
VMware Tanzu
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
Nitesh Jadhav
 
The evolving story for Agile Integration Architecture in 2019
The evolving story for Agile Integration Architecture in 2019The evolving story for Agile Integration Architecture in 2019
The evolving story for Agile Integration Architecture in 2019
Kim Clark
 

Similar to stackconf 2023 | Bringing Order to Chaos: Make Your Systems More Resilient with Chaos Engineering by Sayan Mondal.pdf (20)

Unlock the Cloud: Building a Vendor Independent Private Cloud
Unlock the Cloud: Building a Vendor Independent Private CloudUnlock the Cloud: Building a Vendor Independent Private Cloud
Unlock the Cloud: Building a Vendor Independent Private Cloud
 
PHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on BluemixPHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on Bluemix
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developers
 
Containers vs. VMs: It's All About the Apps!
Containers vs. VMs: It's All About the Apps!Containers vs. VMs: It's All About the Apps!
Containers vs. VMs: It's All About the Apps!
 
Enterprise serverless
Enterprise serverlessEnterprise serverless
Enterprise serverless
 
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
Private PaaS & Container-as-a-Service for ISVs and Enterprise - Use Cases and...
 
Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016
 
CIT-2697 - Customer Success Stories with IBM PureApplication System
CIT-2697 - Customer Success Stories with IBM PureApplication SystemCIT-2697 - Customer Success Stories with IBM PureApplication System
CIT-2697 - Customer Success Stories with IBM PureApplication System
 
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
Microservices: Why and When? - Alon Fliess, CodeValue - Cloud Native Day Tel ...
 
IBM RedHat OCP Vs xKS.pptx
IBM RedHat OCP Vs xKS.pptxIBM RedHat OCP Vs xKS.pptx
IBM RedHat OCP Vs xKS.pptx
 
Let's banish "it works on my machine"
Let's banish "it works on my machine"Let's banish "it works on my machine"
Let's banish "it works on my machine"
 
Cloud 12 08 V2
Cloud 12 08 V2Cloud 12 08 V2
Cloud 12 08 V2
 
Cloud to Edge
Cloud to EdgeCloud to Edge
Cloud to Edge
 
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with IstioDevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
DevOpsDays Houston 2024: Kubernetes at Scale Going Multi-Cluster with Istio
 
Containerize, PaaS, or Go Serverless!?
Containerize, PaaS, or Go Serverless!?Containerize, PaaS, or Go Serverless!?
Containerize, PaaS, or Go Serverless!?
 
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
 
The Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian CockcroftThe Future of Cloud Innovation, featuring Adrian Cockcroft
The Future of Cloud Innovation, featuring Adrian Cockcroft
 
VMware Tanzu Service Mesh from the Developer’s Perspective
VMware Tanzu Service Mesh from the Developer’s PerspectiveVMware Tanzu Service Mesh from the Developer’s Perspective
VMware Tanzu Service Mesh from the Developer’s Perspective
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
The evolving story for Agile Integration Architecture in 2019
The evolving story for Agile Integration Architecture in 2019The evolving story for Agile Integration Architecture in 2019
The evolving story for Agile Integration Architecture in 2019
 

Recently uploaded

BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdfBRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
Robin Haunschild
 
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
OECD Directorate for Financial and Enterprise Affairs
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
samililja
 
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussionPro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Carrer goals.pptx and their importance in real life
Carrer goals.pptx  and their importance in real lifeCarrer goals.pptx  and their importance in real life
Carrer goals.pptx and their importance in real life
artemacademy2
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
OECD Directorate for Financial and Enterprise Affairs
 
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
OECD Directorate for Financial and Enterprise Affairs
 
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussionPro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
OECD Directorate for Financial and Enterprise Affairs
 
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
OECD Directorate for Financial and Enterprise Affairs
 
Disaster Management project for holidays homework and other uses
Disaster Management project for holidays homework and other usesDisaster Management project for holidays homework and other uses
Disaster Management project for holidays homework and other uses
RIDHIMAGARG21
 
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussionArtificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
kekzed
 
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussionArtificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
OECD Directorate for Financial and Enterprise Affairs
 
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdfWhy Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
Ben Linders
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
ToshihiroIto4
 
IEEE CIS Webinar Sustainable futures.pdf
IEEE CIS Webinar Sustainable futures.pdfIEEE CIS Webinar Sustainable futures.pdf
IEEE CIS Webinar Sustainable futures.pdf
Claudio Gallicchio
 
The remarkable life of Sir Mokshagundam Visvesvaraya.pptx
The remarkable life of Sir Mokshagundam Visvesvaraya.pptxThe remarkable life of Sir Mokshagundam Visvesvaraya.pptx
The remarkable life of Sir Mokshagundam Visvesvaraya.pptx
JiteshKumarChoudhary2
 

Recently uploaded (20)

BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdfBRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
BRIC_2024_2024-06-06-11:30-haunschild_archival_version.pdf
 
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
The Intersection between Competition and Data Privacy – COLANGELO – June 2024...
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
 
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussionPro-competitive Industrial Policy – LANE – June 2024 OECD discussion
Pro-competitive Industrial Policy – LANE – June 2024 OECD discussion
 
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
The Intersection between Competition and Data Privacy – CAPEL – June 2024 OEC...
 
Carrer goals.pptx and their importance in real life
Carrer goals.pptx  and their importance in real lifeCarrer goals.pptx  and their importance in real life
Carrer goals.pptx and their importance in real life
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
 
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
Artificial Intelligence, Data and Competition – ČORBA – June 2024 OECD discus...
 
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussionPro-competitive Industrial Policy – OECD – June 2024 OECD discussion
Pro-competitive Industrial Policy – OECD – June 2024 OECD discussion
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
 
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
The Intersection between Competition and Data Privacy – OECD – June 2024 OECD...
 
Disaster Management project for holidays homework and other uses
Disaster Management project for holidays homework and other usesDisaster Management project for holidays homework and other uses
Disaster Management project for holidays homework and other uses
 
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussionArtificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – LIM – June 2024 OECD discussion
 
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
 
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussionArtificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
Artificial Intelligence, Data and Competition – OECD – June 2024 OECD discussion
 
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdfWhy Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
Why Psychological Safety Matters for Software Teams - ACE 2024 - Ben Linders.pdf
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
 
IEEE CIS Webinar Sustainable futures.pdf
IEEE CIS Webinar Sustainable futures.pdfIEEE CIS Webinar Sustainable futures.pdf
IEEE CIS Webinar Sustainable futures.pdf
 
The remarkable life of Sir Mokshagundam Visvesvaraya.pptx
The remarkable life of Sir Mokshagundam Visvesvaraya.pptxThe remarkable life of Sir Mokshagundam Visvesvaraya.pptx
The remarkable life of Sir Mokshagundam Visvesvaraya.pptx
 

stackconf 2023 | Bringing Order to Chaos: Make Your Systems More Resilient with Chaos Engineering by Sayan Mondal.pdf

  • 1. Confidential / © Harness Inc. 2020 Bringing order to Chaos Make your systems more resilient with Chaos Engineering
  • 2. Confidential / © Harness Inc. 2021 Sayan Mondal Senior Software Engineer II at Maintainer at @s_ayanide @s-ayanide ● Senior Software Engineer 2 at Harness ● Maintainer of LitmusChaos (CNCF incubating project) for 2 years, contributing since 3.5 years ● Volunteer and mentor at Linux Foundation ● Chaos Carnival organizing team member
  • 3. Confidential / © Harness 2022 What causes Downtime Application Failures Reputational Impact Financial Impact Poor User Experience Slack’s Outages Est. >$55M in losses to WF 75,000+ passengers travel plans impacted Infrastructure Failures Operational Failures
  • 4. Confidential / © Harness Inc. 2021 The cloud native problem Proliferation of applications into micro services leads to a RELIABILITY challenge In cloud native, your code depends on hundreds of other microservices and runs on many platforms. The potential of being subjected to a dependent component failure is huge. 1 Your Application 3 Cloud Native Services CoreDNS, Envoy, Prometheus, OpenEBS, etc. 5 Platform Services Infrastructure 2 Your Application’s Dependencies MongoDB, Kafka, TiKV, Vitess, Postgres, etc. 4 Kubernetes Services
  • 5. Confidential / © Harness Inc. 2021 Problems with existing solutions Not automated Not collaborative Reactive Approach ● No proactive investments for failure testing ● Generally driven by root cause analysis ● No proactive investments for failure testing ● Generally driven by root cause analysis ● Driven by Ops ● Not integrated into CI/CD or Gamedays
  • 6. Confidential / © Harness Inc. 2021 The solution? Chaos Engineering SREs + Developers Experiments are in Git just like code Chaos engineering is collaborative Collaborative chaos experiments in a centralized control plane Optimize initial investment Reduce the inertia for starting chaos Robust Experiments Public and private chaos hubs with ready to use experiments Find weaknesses during build/test phase Verifying at dev stage saves money Integrate into CI/CD systems Rollout automated and controlled chaos experiments across prod/non-prod environments Measure the impact of inducing chaos Build confidence by starting small Enables observability for Chaos Chaos metrics used to assess impact and manage SLOs/Errors
  • 7. Confidential / © Harness Inc. 2021 How to do Chaos Engineering? Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
  • 8. Confidential / © Harness Inc. 2021 Project LitmusChaos - a CNCF incubating project Started in 2017; 4+ years of active development 350K+ Litmus installations; 30x usage growth in the last 3 quarters, 50+ chaos experiments, 100+ contributors Stable platform : 2.0 released 50+ enterprises using 2.0 CNCF Incubating project Litmus is an open source platform for practicing chaos engineering in a cloud native way. CNCF Incubating project 30x growth in per-day installations of Litmus in the last 3 quarters; 1500 installations per day Litmus is adopted by
  • 9. Confidential / © Harness Inc. 2021 What’s Exciting about LitmusChaos? Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing Chaos Workflow editor User ChaosCenter Management & Teaming ChaosAgent Custom Image Registry GitOps ChaosCenter Monitoring and Observability Resilience Score Calculation ChaosHub Scheduling More Control over Chaos Results New & Enhanced Experiments Support for Cri-o, ContainerD and Docker
  • 10. Confidential / © Harness Inc. 2021 Architecture Overview of LitmusChaos 2.0 Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing
  • 11. Confidential / © Harness Inc. 2021 Chaos Workflow Deep Dive
  • 12. The ChaosCenter is a single source of truth to control all the different Chaos Activities happening around Litmus. From the ChaosCenter you get the freedom to manage every single part of Litmus and shape your workflows exactly the way you want it. A ChaosAgent in Litmus is nothing but the target cluster where Chaos would be injected via Litmus. There should always be at least one or more than one ChaosAgents connected to the ChaosCenter. Each individual ChaosAgent can be chosen to be the Target Agent for Chaos Injection. Core Components of LitmusChaos “ “ ChaosCenter Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing ChaosAgent
  • 13. Confidential / © Harness Inc. 2021 Variety of faults offered in LitmusChaos Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing Pod Chaos Node Chaos Network Chaos Stress Chaos Cloud Services Application Chaos Pod Failure Container Kill Pod Autoscale Node Drain Forced Eviction (Node Taints) Node Restart/PowerOff Network Latency Packet Loss Network Corruption, Duplication Pod, Node CPU Hog Pod, Node Memory Hog Pod, Node Disk Stress Pod Ephemeral Storage Fill AWS EKS EC2 Termination AWS EBS Disk Detach GCP GPD Disk Detach Kafka Leader Broker Failure Cassandra Ring Disruption OpenEBS Control Plane / Volume Failure
  • 14. Confidential / © Harness Inc. 2020 The Features Let’s take a look at the core features offered
  • 15. Confidential / © Harness Inc. 2020 Chaos Center Chaos Workflows Automate dependency setup, create complex chaos scenarios, support definition of load/validation jobs along with chaos injection Multiple options From Templates, Custom Workflows from Scratch (using ChaosHubs), From pre-created YAMLs Crontrol Chaos Experiments Sequence Control (Parallel as well as Sequential steps creation) Schedules Creation of either Singular or Cron Workflows as Schedules Experiment Priority Attaching priority to Chaos Experiments based on your use cases
  • 16. Confidential / © Harness Inc. 2020 Workflow Management GitOps Rolling out automated changes using GitOps Custom Image Allowing image addition from custom image server (both public and private) Resilience Score Measure and Analyse the Resilience Score of each workflow
  • 17. Confidential / © Harness Inc. 2020 Multi Tenancy Scope Support Supports setup (control plane & agents) and execution of chaos experiments in both cluster scoped and namespace scoped modes. Authentication Authentication and a smooth onboarding process. Choose between email and password auth or OAuth with Google or GitHub for your teams. Create Teams Creating a Team of multiple Users and Project Management Fine-Grained RBAC Flexible RBAC to drill down and grant correct privileges to users.
  • 18. Confidential / © Harness Inc. 2020 Monitoring & Observability Connect Datasource Connect a Data Source (from any Agent) and monitor workflows Visualization Visualize workflow run statistics and aggregated schedules Comparison Compare two or more Workflows Upload Dashboards Upload shared/downloadable dashboards available in the community Tune Dashboards Edit queries, Tune dashboards to create a custom one from scratch Monitor in Real Time Monitor effect of chaos in real time with interleaved events and metrics from Prometheus Datasource
  • 19. Confidential / © Harness Inc. 2020 GitOps for Chaos Git based SCM Integrates with Git-based SCM to provide a single source of truth for chaos artifacts (workflows), such that changes are synchronized bi-directionally b/w the git source and the chaos center - thereby pulling the latest artifact for execution. Tracking Provides an event-tracker microservice to automatically launch “subscribed” chaos workflows upon app upgrades affected by GitOps tools like ArgoCD, Flux
  • 20. Confidential / © Harness Inc. 2020 Non Kubernetes Chaos Chaos on Infra Inject chaos on infrastructure resources such as VMs/instances and disks (AWS, GCP, Azure, VMWare) Attack Baremetal Introduces chaos experiments to bring down baremetal nodes that provide IPMI-based out-of-band access. Chaos on Machine Litmus has developed m-agent: a platform generic daemon agent for orchestrating chaos into any computing node.
  • 21. Confidential / © Harness Inc. 2020 Hands On Demo Time
  • 22. Confidential / © Harness Inc. 2020 Install the components
  • 23. Confidential / © Harness Inc. 2020 Pick the faults
  • 24. Confidential / © Harness Inc. 2020 Inject Chaos in your application
  • 25. Confidential / © Harness Inc. 2020 Observe Impact
  • 26. Confidential / © Harness Inc. 2021 Seamless support for cross cloud connectivity and interactions. Target you applications running on your preferred cloud provider with Litmusctl. Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing Multi-Cloud Support with LitmusChaos 2.0
  • 27. Future Roadmap What’s ahead for us? ● Increased support for chaos against Non-Kubernetes infrastructure components ● More Application specific chaos experiments with native faults and health checks ● Improved Chaos SDK for creation of user-defined experiments ● Additional probe types for diverse steady state-hypothesis validation ● Improved Observability for chaos experiments ● More community supported Chaos Types
  • 28. Confidential / © Harness Inc. 2021 How to Contribute? Why Resilience? Achieving Resilience Litmus 101 Litmus Experiments Contributing We welcome contributions of all kinds ● Development of features, bug fixes, and other improvements. ● Documentation including reference material and examples. ● Bug and feature reports You can choose from a list of sub-dependent repos to contribute to, a few highlighted repos that Litmus uses are: ● Chaos-charts ● Chaos-workflows ● Test-tools ● Litmus UI ● Litmus-go ● website-litmuschaos
  • 29. Confidential / © Harness Inc. 2021 Conclusion ● Chaos Engineering ● A tool called LitmusChaos ● Architecture principle ● Core Components of Chaos Induction ● Demo Conclusion
  • 30. Confidential / © Harness Inc. 2020 /LitmusChaos /litmuschaos Follow Litmus on Thank You @s_ayanide /s-ayanide Contact me on