SlideShare a Scribd company logo
Analyses and Takeaways
Featured speakers
Brian Tobia
Technical Marketing Engineer
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
3
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Before We Begin…
• If you have any questions, please type them in the Questions window.
• If you have any audio problems, please chat us for help.
• A recording of this presentation will be sent to you in a few days.
• Interested in more outage analysis and Internet insights? Check out the ThousandEyes
blog and The Internet Report podcast.
Anatomy of an Outage
• Understanding different types of
Internet outages is important to
mitigate their impact.
• Outages can vary in blast radius, be
planned or unplanned, and have
varying MTTR.
• Network outages depend on where
the problem occurs, with transit
network incidents impacting multiple
providers.
• Tracking outages can help teams
identify patterns and prevent
customer service disruptions.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
5
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Outage and Degradation Impacts
BGP
ISP
CDN
DNS SaaS Apps
Services
APIs
Data Center
Cloud
DDoS
Protection
SSE
RISK AND
COMPLIANCE
Is our traffic
getting routed
out of region?
SERVICE
AVAILABILITY
Which cloud
regions are
impacted?
SITUATIONAL
AWARENESS
Are regional
ISPs spoofing
our DNS
records?
SERVICE
RECOVERY
Did we
successfully
cut over to
our DDoS
mitigation
service?
NETWORK
SECURITY
Are SASE routing
policies working
as we expect?
CUSTOMER
SUPPORT
Is an Internet
outage
preventing users
from reaching our
service?
WORKFORCE
PRODUCTIVITY
Will our Salesforce
dev updates
degrade
performance
for some global
users?
$32,000
$120,000
$3,500
3474
REVENUE
PROTECTION
Is the payment
gateway down
or just
unreachable?
2023 Outages by the Numbers: ISP Compared to CSP
• ThousandEyes reported an increase in cloud service provider (CSP) outages in 2023.
• CSP outages are the second most common type of disruption after ISP outages.
• The ratio of CSP outages to ISP outages increased in 2023.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
2023 Outages by the Numbers:
U.S.-centric Outages in Relation to Global Outages
• U.S.-centric outages increased to 37% in 2023 from 34% in 2022.
• Smaller, contained outages are becoming more common.
• Localized outages have different impacts and require different responses compared to global outages.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
2023 Outages by the Numbers: Application Outages
• The number and frequency of application outages have been on the rise over the past year.
• Application-related disruptions can have a bigger impact than network outages, though they are not as common.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
9
Connections Are Complex
Branch
Office
Employee
BYOD
Corp devices
IOT
Cameras
and sensors
IoT
VDI
People, places,
and things
Edge
BYOD
Data Center
IOT
Core
network
Mobile
networks
Core
network
Peering
Access
networks
Wireless
network
Wireless
gateway
DNS
Cloud
and SaaS
Cloud
providers
Datacenter
infrastructure
Cloud
connectivity
Direct
connect
ISP transit
providers
SaaS
onramp
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
10
Correlate Performance Across Every Layer
8
3
9
3
5
4
6
8
6
Time Correlated
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Microsoft
(1/25)
Outlook
(2/7)
Virgin Media
(4/4)
AWS
(6/13)
Slack
(8/2)
Square
(9/8)
Workday +
Cloudflare (11/2)
2023 Outage Timeline
Purple = Application Outage
Red = Network Outage
Blue = Infrastructure Outage
Bookmark the Internet
Outages Timeline for outage
updates throughout the year.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Slack (Aug 2, 2023)
~2 hours
Unable to send/receive messages
AWS (Jun 13, 2023)
~2 hours
Latency, server timeouts, and HTTP errors
Virgin Media (Apr 4, 2023)
~7 hours
Network traffic loss/BGP route withdrawal
Microsoft Outlook (Feb 7, 2023)
~2 hours
Service unavailable/application errors
Microsoft 365 (Jan 25, 2023)
~90 minutes
Network issues due to BGP changes
#2
#3
#1
#4
#5
Square (Sept 8, 2023)
~12 hours
App errors and backend transactions failing
#6
Workday + Cloudflare (Nov 2, 2023)
~36 hours
Application and service outages
#7
Microsoft 365 (1/25/23)
• Microsoft started experiencing service related issues around
07:05 AM (UTC).
• The disruption was triggered by an external BGP change
by Microsoft that impacted connected service providers
• Microsoft BGP prefixes were withdrawn completely
but then almost immediately re-advertised.
• Affected smaller (/24) prefixes and summary prefixes (/12).
• Cascading impact on global routing tables, causing
significant churn.
• Prefixes were either withdrawn or re-advertised to
transit providers.
• Large amount of packet loss were seen as well as
HTTP and DNS timeouts.
• Timeouts seen in the application “Response,” further
indicating the effect of the network on service availability.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Microsoft Outlook (2/7/23)
• Starting around 03:55 UTC,
Outlook became unavailable.
• Network path was working
properly, but ThousandEyes
observed elevated server
response timeouts and slow
page loading.
• Majority of the errors were
HTTP server timeouts,
indicating an application issue.
• Incident was mostly
concentrated in the U.S. and
lasted ~2 hours.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Virgin Media (4/4/23)
• From approximately 00:30 to 17:30 UTC,
two outages impacted the reachability of
Virgin Media UK’s network and services.
• The first incident began at approximately
00:30 UTC and appeared to coincide with
a series of BGP route withdrawals.
• Second incident was shorter, but the
networks experienced similar BGP and
reachability issues.
• Outages were overnight and due to the
repeat nature, could indicate
maintenance issues.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
AWS (6/13/23)
• Outage impacted services within US-
EAST-1.
• Lasted two hours and increased latency,
server timeouts, and HTTP server errors
were observed.
• AWS console access was also affected,
making troubleshooting difficult.
• AWS confirmed the issue was due to a
capacity management subsystem failure.
• Organizations leveraging cloud services,
such as those offered by AWS, should be
aware of the relationships in their digital
ecosystem, regardless of whether those
relationships are services or networks.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Slack (8/2/23)
• Application outage that lasted from 4:01
PM to 6 PM (UTC).
• Network paths and accessibility were
unaffected.
• Initially could be seen as HTTP 500
errors and higher-than-normal page load
times.
• During the outage, users were unable to
upload files or share screenshots.
• Root cause—work on a “routine
database cluster migration”—that
accidentally reduced database capacity
to the point that it could not support a
regularly scheduled job running.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Square (9/8/23)
• Outage lasted over 18 hours.
• Backend issue that prevented
the platform from processing
payment transactions.
• Users reported various
problems, from terminal
connections dropping out, to
payments appearing to
complete but then not showing
up in business accounts.
• ThousandEyes observed
intermittent dropouts and 503
‘service unavailable’ errors.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Workday + Cloudflare (11/2/23)
• Cloudflare and Workday
experienced a major outage
due to multiple infrastructure
provider failures.
• DR resources took 6 hours to
come online and full resolution
took 36 hours.
• Initial cause was a partial mains
power outage at a Flexential
data center in Portland.
• Further generator and grid
failures resulted in a complete
power loss and ungraceful
shutdown.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Takeaways
• Understanding how your application works is important for quickly identifying failures and making
improvements.
• Just because your application is working doesn't mean it's functioning optimally.
• Knowing how all parts of the service work together is crucial for ongoing design and future optimizations.
• Improved visibility and operational optimizations can prevent outages and minimize their impact.
• Tracking different categories of outages and degradations over time can be helpful.
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
• Subscribe to our blog to keep up-to-date!
thousandeyes.com/blog/
• Tune in to The Internet Report Podcast.
https://www.thousandeyes.com/the-internet-report/
21
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
Next Steps
• New tutorial videos on our features
thousandeyes.com/resources/?cat=tutorial
• New Getting Started Guides
docs.thousandeyes.com/product-documentation/getting-started
Blog and
Podcast
Learning
Resources
Support
Community
• Still have questions? Ask us on the ThousandEyes
Support Community AMA: http://bit.ly/2023Outages
Q&A
22
© 2023 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
23
© 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.

More Related Content

What's hot

Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
Blockchain for IoT Security and Privacy: The Case Study of a Smart HomeBlockchain for IoT Security and Privacy: The Case Study of a Smart Home
Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
Kishor Datta Gupta
 
13 DHCP Configuration in Linux
13 DHCP Configuration in Linux13 DHCP Configuration in Linux
13 DHCP Configuration in Linux
Hameda Hurmat
 
VPN (virtual Private Network)
VPN (virtual Private Network)VPN (virtual Private Network)
VPN (virtual Private Network)
Chandan Jha
 
Remote desktop connection
Remote desktop connectionRemote desktop connection
Remote desktop connection
Jasleen Kaur (Chandigarh University)
 

What's hot (20)

Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
Blockchain for IoT Security and Privacy: The Case Study of a Smart HomeBlockchain for IoT Security and Privacy: The Case Study of a Smart Home
Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
 
IPSec and VPN
IPSec and VPNIPSec and VPN
IPSec and VPN
 
Cloud Computing and Its Service Models
Cloud Computing and Its Service Models Cloud Computing and Its Service Models
Cloud Computing and Its Service Models
 
13 DHCP Configuration in Linux
13 DHCP Configuration in Linux13 DHCP Configuration in Linux
13 DHCP Configuration in Linux
 
Introduction to IoT Security
Introduction to IoT SecurityIntroduction to IoT Security
Introduction to IoT Security
 
Firewall ( Cyber Security)
Firewall ( Cyber Security)Firewall ( Cyber Security)
Firewall ( Cyber Security)
 
cloud virtualization technology
 cloud virtualization technology  cloud virtualization technology
cloud virtualization technology
 
Proxy Presentation
Proxy PresentationProxy Presentation
Proxy Presentation
 
VPN (virtual Private Network)
VPN (virtual Private Network)VPN (virtual Private Network)
VPN (virtual Private Network)
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Kali linux.ppt
Kali linux.pptKali linux.ppt
Kali linux.ppt
 
Remote desktop connection
Remote desktop connectionRemote desktop connection
Remote desktop connection
 
Cloud Computing paradigm
Cloud Computing paradigmCloud Computing paradigm
Cloud Computing paradigm
 
Network devices
Network devicesNetwork devices
Network devices
 
Cloud computing and data security
Cloud computing and data securityCloud computing and data security
Cloud computing and data security
 
Features of a wireless network
Features of a wireless networkFeatures of a wireless network
Features of a wireless network
 
Open source operating systems
Open source operating systemsOpen source operating systems
Open source operating systems
 
Lecture 8 permissions
Lecture 8   permissionsLecture 8   permissions
Lecture 8 permissions
 
Vpn
VpnVpn
Vpn
 
Firewall
FirewallFirewall
Firewall
 

Similar to The Top Outages of 2023: Analysis and Takeaways

0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
Saurabh Chauhan
 

Similar to The Top Outages of 2023: Analysis and Takeaways (20)

The Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThe Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and Takeaways
 
The Top Outages of 2022: Analysis and Takeaways
The Top Outages of 2022: Analysis and TakeawaysThe Top Outages of 2022: Analysis and Takeaways
The Top Outages of 2022: Analysis and Takeaways
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
EMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptx
EMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptxEMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptx
EMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptx
 
The Top Outages of 2022: Analysis and Takeaways
The Top Outages of 2022: Analysis and TakeawaysThe Top Outages of 2022: Analysis and Takeaways
The Top Outages of 2022: Analysis and Takeaways
 
Microsoft Outage Analysis
Microsoft Outage AnalysisMicrosoft Outage Analysis
Microsoft Outage Analysis
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
What is ThousandEyes Webinar
What is ThousandEyes WebinarWhat is ThousandEyes Webinar
What is ThousandEyes Webinar
 
How to Evaluate, Rollout and Operationalize Your SD-WAN Projects
How to Evaluate, Rollout and Operationalize Your SD-WAN ProjectsHow to Evaluate, Rollout and Operationalize Your SD-WAN Projects
How to Evaluate, Rollout and Operationalize Your SD-WAN Projects
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
How to Evaluate, Rollout, and Operationalize Your SD-WAN Projects
How to Evaluate, Rollout, and Operationalize Your SD-WAN ProjectsHow to Evaluate, Rollout, and Operationalize Your SD-WAN Projects
How to Evaluate, Rollout, and Operationalize Your SD-WAN Projects
 
The Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and TakeawaysThe Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and Takeaways
 
Is Your Network Ready?
Is Your Network Ready?Is Your Network Ready?
Is Your Network Ready?
 
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
 
Introduction To ThousandEyes
Introduction To ThousandEyesIntroduction To ThousandEyes
Introduction To ThousandEyes
 

More from ThousandEyes

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
ThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
ThousandEyes
 

More from ThousandEyes (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024
 
Assure Patient and Clinician Digital Experiences with ThousandEyes for Health...
Assure Patient and Clinician Digital Experiences with ThousandEyes for Health...Assure Patient and Clinician Digital Experiences with ThousandEyes for Health...
Assure Patient and Clinician Digital Experiences with ThousandEyes for Health...
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for Partners
 
The Top Outages of 2023: Analysis and Takeaways
The Top Outages of 2023: Analysis and TakeawaysThe Top Outages of 2023: Analysis and Takeaways
The Top Outages of 2023: Analysis and Takeaways
 
ThousandEyes Enterprise Digital Workshop - Spanish
ThousandEyes Enterprise Digital Workshop - SpanishThousandEyes Enterprise Digital Workshop - Spanish
ThousandEyes Enterprise Digital Workshop - Spanish
 
ThousandEyes Enterprise Digital Workshop - German
ThousandEyes Enterprise Digital Workshop - GermanThousandEyes Enterprise Digital Workshop - German
ThousandEyes Enterprise Digital Workshop - German
 
ThousandEyes Enterprise Digital Workshop
ThousandEyes Enterprise Digital WorkshopThousandEyes Enterprise Digital Workshop
ThousandEyes Enterprise Digital Workshop
 
Introduction to ThousandEyes and Meraki MX for Partners
Introduction to ThousandEyes and Meraki MX for PartnersIntroduction to ThousandEyes and Meraki MX for Partners
Introduction to ThousandEyes and Meraki MX for Partners
 
Level-up Your Cloud Visibility Into AWS With ThousandEyes
Level-up Your Cloud Visibility Into AWS With ThousandEyesLevel-up Your Cloud Visibility Into AWS With ThousandEyes
Level-up Your Cloud Visibility Into AWS With ThousandEyes
 
Level-up Your Cloud Visibility Into AWS With ThousandEyes
Level-up Your Cloud Visibility Into AWS With ThousandEyesLevel-up Your Cloud Visibility Into AWS With ThousandEyes
Level-up Your Cloud Visibility Into AWS With ThousandEyes
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 

The Top Outages of 2023: Analysis and Takeaways

  • 2. Featured speakers Brian Tobia Technical Marketing Engineer © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 3. 3 © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved. Before We Begin… • If you have any questions, please type them in the Questions window. • If you have any audio problems, please chat us for help. • A recording of this presentation will be sent to you in a few days. • Interested in more outage analysis and Internet insights? Check out the ThousandEyes blog and The Internet Report podcast.
  • 4. Anatomy of an Outage • Understanding different types of Internet outages is important to mitigate their impact. • Outages can vary in blast radius, be planned or unplanned, and have varying MTTR. • Network outages depend on where the problem occurs, with transit network incidents impacting multiple providers. • Tracking outages can help teams identify patterns and prevent customer service disruptions. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 5. 5 © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved. Outage and Degradation Impacts BGP ISP CDN DNS SaaS Apps Services APIs Data Center Cloud DDoS Protection SSE RISK AND COMPLIANCE Is our traffic getting routed out of region? SERVICE AVAILABILITY Which cloud regions are impacted? SITUATIONAL AWARENESS Are regional ISPs spoofing our DNS records? SERVICE RECOVERY Did we successfully cut over to our DDoS mitigation service? NETWORK SECURITY Are SASE routing policies working as we expect? CUSTOMER SUPPORT Is an Internet outage preventing users from reaching our service? WORKFORCE PRODUCTIVITY Will our Salesforce dev updates degrade performance for some global users? $32,000 $120,000 $3,500 3474 REVENUE PROTECTION Is the payment gateway down or just unreachable?
  • 6. 2023 Outages by the Numbers: ISP Compared to CSP • ThousandEyes reported an increase in cloud service provider (CSP) outages in 2023. • CSP outages are the second most common type of disruption after ISP outages. • The ratio of CSP outages to ISP outages increased in 2023. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 7. 2023 Outages by the Numbers: U.S.-centric Outages in Relation to Global Outages • U.S.-centric outages increased to 37% in 2023 from 34% in 2022. • Smaller, contained outages are becoming more common. • Localized outages have different impacts and require different responses compared to global outages. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 8. 2023 Outages by the Numbers: Application Outages • The number and frequency of application outages have been on the rise over the past year. • Application-related disruptions can have a bigger impact than network outages, though they are not as common. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 9. 9 Connections Are Complex Branch Office Employee BYOD Corp devices IOT Cameras and sensors IoT VDI People, places, and things Edge BYOD Data Center IOT Core network Mobile networks Core network Peering Access networks Wireless network Wireless gateway DNS Cloud and SaaS Cloud providers Datacenter infrastructure Cloud connectivity Direct connect ISP transit providers SaaS onramp © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 10. 10 Correlate Performance Across Every Layer 8 3 9 3 5 4 6 8 6 Time Correlated © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 11. Microsoft (1/25) Outlook (2/7) Virgin Media (4/4) AWS (6/13) Slack (8/2) Square (9/8) Workday + Cloudflare (11/2) 2023 Outage Timeline Purple = Application Outage Red = Network Outage Blue = Infrastructure Outage Bookmark the Internet Outages Timeline for outage updates throughout the year. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 12. Slack (Aug 2, 2023) ~2 hours Unable to send/receive messages AWS (Jun 13, 2023) ~2 hours Latency, server timeouts, and HTTP errors Virgin Media (Apr 4, 2023) ~7 hours Network traffic loss/BGP route withdrawal Microsoft Outlook (Feb 7, 2023) ~2 hours Service unavailable/application errors Microsoft 365 (Jan 25, 2023) ~90 minutes Network issues due to BGP changes #2 #3 #1 #4 #5 Square (Sept 8, 2023) ~12 hours App errors and backend transactions failing #6 Workday + Cloudflare (Nov 2, 2023) ~36 hours Application and service outages #7
  • 13. Microsoft 365 (1/25/23) • Microsoft started experiencing service related issues around 07:05 AM (UTC). • The disruption was triggered by an external BGP change by Microsoft that impacted connected service providers • Microsoft BGP prefixes were withdrawn completely but then almost immediately re-advertised. • Affected smaller (/24) prefixes and summary prefixes (/12). • Cascading impact on global routing tables, causing significant churn. • Prefixes were either withdrawn or re-advertised to transit providers. • Large amount of packet loss were seen as well as HTTP and DNS timeouts. • Timeouts seen in the application “Response,” further indicating the effect of the network on service availability. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 14. Microsoft Outlook (2/7/23) • Starting around 03:55 UTC, Outlook became unavailable. • Network path was working properly, but ThousandEyes observed elevated server response timeouts and slow page loading. • Majority of the errors were HTTP server timeouts, indicating an application issue. • Incident was mostly concentrated in the U.S. and lasted ~2 hours. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 15. Virgin Media (4/4/23) • From approximately 00:30 to 17:30 UTC, two outages impacted the reachability of Virgin Media UK’s network and services. • The first incident began at approximately 00:30 UTC and appeared to coincide with a series of BGP route withdrawals. • Second incident was shorter, but the networks experienced similar BGP and reachability issues. • Outages were overnight and due to the repeat nature, could indicate maintenance issues. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 16. AWS (6/13/23) • Outage impacted services within US- EAST-1. • Lasted two hours and increased latency, server timeouts, and HTTP server errors were observed. • AWS console access was also affected, making troubleshooting difficult. • AWS confirmed the issue was due to a capacity management subsystem failure. • Organizations leveraging cloud services, such as those offered by AWS, should be aware of the relationships in their digital ecosystem, regardless of whether those relationships are services or networks. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 17. Slack (8/2/23) • Application outage that lasted from 4:01 PM to 6 PM (UTC). • Network paths and accessibility were unaffected. • Initially could be seen as HTTP 500 errors and higher-than-normal page load times. • During the outage, users were unable to upload files or share screenshots. • Root cause—work on a “routine database cluster migration”—that accidentally reduced database capacity to the point that it could not support a regularly scheduled job running. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 18. Square (9/8/23) • Outage lasted over 18 hours. • Backend issue that prevented the platform from processing payment transactions. • Users reported various problems, from terminal connections dropping out, to payments appearing to complete but then not showing up in business accounts. • ThousandEyes observed intermittent dropouts and 503 ‘service unavailable’ errors. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 19. Workday + Cloudflare (11/2/23) • Cloudflare and Workday experienced a major outage due to multiple infrastructure provider failures. • DR resources took 6 hours to come online and full resolution took 36 hours. • Initial cause was a partial mains power outage at a Flexential data center in Portland. • Further generator and grid failures resulted in a complete power loss and ungraceful shutdown. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 20. Takeaways • Understanding how your application works is important for quickly identifying failures and making improvements. • Just because your application is working doesn't mean it's functioning optimally. • Knowing how all parts of the service work together is crucial for ongoing design and future optimizations. • Improved visibility and operational optimizations can prevent outages and minimize their impact. • Tracking different categories of outages and degradations over time can be helpful. © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 21. • Subscribe to our blog to keep up-to-date! thousandeyes.com/blog/ • Tune in to The Internet Report Podcast. https://www.thousandeyes.com/the-internet-report/ 21 © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved. Next Steps • New tutorial videos on our features thousandeyes.com/resources/?cat=tutorial • New Getting Started Guides docs.thousandeyes.com/product-documentation/getting-started Blog and Podcast Learning Resources Support Community • Still have questions? Ask us on the ThousandEyes Support Community AMA: http://bit.ly/2023Outages
  • 22. Q&A 22 © 2023 Cisco Systems, Inc. and/or its affiliates. All rights reserved.
  • 23. 23 © 2024 Cisco Systems, Inc. and/or its affiliates. All rights reserved.