AWS Outage Analysis

•Download as PPTX, PDF•

0 likes•353 views

ThousandEyes

AWS Outage

Technology

2
Featured Speaker
Audil Khan
Technical Solutions Architect

3
Before We Begin...
• If you have any questions, please type them in the Questions window.
• If you have any audio problems, please chat us for help.
• A recording of this presentation will be sent to you in a few days.
3
@ThousandEyes

4
Agenda
• About ThousandEyes
• Key Technical Concepts
• Outage Timelines and Details
• Demo – ThousandEyes Outage Analysis
• Lessons & Takeaways
• Q&A
4
@ThousandEyes

5
Actionable Insight for Internet, Cloud, and SaaS
Correlated Insights
Quickly isolate issues to app, network,
or service
Network Visibility
Overlay, hop-by-hop underlay, ISP
performance, and BGP routing
App Experience
SaaS, API, and internal app
performance and user experience

6
Your Network ISP Cloud Provider
See the Internet Like It’s Your Own Network
Moscow, Russia
Paris, France
Chicago, IL
Visualize the link between network
topologies and service delivery
Rapidly isolate problem
domain and owner

7
ThousandEyes Collective Intelligence
20K+
Vantage
Points
Billions
Daily Path
Measurements
Thousands
Digital
Services
110+
Countries
1100+
Cities

8
ThousandEyes Internet Insights: App Outages
Dev Tools
Communication
Tools
Human
Resources
Social
Networking
Finance eCommerce
Sales &
Marketing
Collaboration
Tools
Top Business SaaS Apps
• Global View of SaaS App Availability
• Accelerated & Empowered IT Operations
• Data-driven Vendor Governance

10
Amazon Web Services – At a Glance
• Availability Zones
• Key Components
– EC2 – compute
– S3 – storage
– API Gateway
• Ecosystem
– 200+ services
• US-EAST-1 outsized
interdependency

11
Application Programming Interface (API)
• Enables communication
between disparate
applications/systems
• Increased application
complexity
• Interdependencies and
domino effects

12
Amazon API Gateway
• Gatekeeper for backend
APIs in AWS
• Capable of processing
hundreds of thousands of
concurrent API calls
• AWS offers internal services to
customers via API Gateway

14
12/7 - Event Sequence as Observed by ThousandEyes
1532 UTC –
Outage
Begins
1535 UTC -
Server
Response
Failures
1640 UTC -
AWS Status
Page - First
Mention
1712 UTC –
AWS API
Transaction
Times
Increase
0100 UTC –
Return to
Normal

15
12/7 - Event Sequence from Amazon RCA
1530 UTC –
Multiple
services
impacted due
to congestion
from
automated
activity
1533 UTC –
EC2 API
errors and
increased
latency
1728 UTC –
Internal DNS
remediation,
issues still
persist
Ongoing
network
congestion
remediation
measures
2134 UTC –
Significant
alleviation of
network
congestion
2135 UTC –
Container API
begins to
return to
normal
2222 UTC –
Network
devices and
AWS Console
access “all
clear”
2230 UTC -
Route 53 APIs
"all clear"
2240 UTC -
EC2 "all
clear"
0041 UTC –
API Gateways
recovered

16
12/10 - Event Sequence as Observed by ThousandEyes
1305 UTC –
Outage
Begins
Server
Response
Failures
Brief Clear,
Followed by
Resumption
1430 UTC –
Return to
Normal

18
Lessons and Takeaways
• Understand your network and application interdependencies
– Front-end interfaces often depend on many back-end APIs
• How does your cloud provider work?
– Understand architecture and interdependencies
– Single AZ, multi-AZ, multi-cloud
– AWS ≠ Azure ≠ GCP
• Inform your Incident Response / Outage Management
– Specific guidance when issues take place
– Example: we’re seeing 2x API responses and it is impacting x, y, z across all zones
• Independent visibility and verification is needed
– Don’t just depend on the status page!

19
@ThousandEyes
Learn
more
Free
Trial /
Demo
Next Steps
• Subscribe! https://blog.thousandeyes.com
• Get a real-time view of the health of the Internet
https://thousandeyes.com/outages
• Sign up for a Free Trial:
https://www.thousandeyes.com/signup
• Request a demo:
https://www.thousandeyes.com/request-demo

What's hot

Scale Your Load Balancer from 0 to 1 million TPS on AzureAvi Networks

Introduction to RightScaleAkelios

Monitoring IPv6 NetworksThousandEyes

Petabytes and NanosecondsRobert Greiner

Managing Network Performance Within and Beyond Your EnterpriseThousandEyes

Network monitoring for the modern wan webinarThousandEyes

Keynote: Customer Journey with Streaming Data on AWS - Rahul Pathak, AWSFlink Forward

Identifying Workloads to Move to the CloudRightScale

RightScale Webinar: Best Practices: Software Development Strategies Using Win...RightScale

Automating nist 800 171 compliance in AWS Govcloud (US)Amazon Web Services

Building a Global Multi-Tenant Monitoring PlatformAmazon Web Services

Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Amazon Web Services

Demystifying Cloud Economics - How to Build an Investment Case for Scale Migr...Amazon Web Services

Tagging Best Practices for Cloud GovernanceRightScale

AWS Webcast - Neudesic Data CentermigrationtoawsAmazon Web Services

Next Level Digital Media with Alibaba Cloud (Part 2)Alibaba Cloud

Secure your critical workload on AWSAmazon Web Services

How to Allocate and Report Cloud Costs with RightScale OptimaRightScale

Hybrid Cloud Orchestration: How SuperChoice Does ItRightScale

Multi-Cloud Management with RightScale CMP (Demo)RightScale

What's hot (20)

Scale Your Load Balancer from 0 to 1 million TPS on Azure

Introduction to RightScale

Monitoring IPv6 Networks

Petabytes and Nanoseconds

Managing Network Performance Within and Beyond Your Enterprise

Network monitoring for the modern wan webinar

Keynote: Customer Journey with Streaming Data on AWS - Rahul Pathak, AWS

Identifying Workloads to Move to the Cloud

RightScale Webinar: Best Practices: Software Development Strategies Using Win...

Automating nist 800 171 compliance in AWS Govcloud (US)

Building a Global Multi-Tenant Monitoring Platform

Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...

Demystifying Cloud Economics - How to Build an Investment Case for Scale Migr...

Tagging Best Practices for Cloud Governance

AWS Webcast - Neudesic Data Centermigrationtoaws

Next Level Digital Media with Alibaba Cloud (Part 2)

Secure your critical workload on AWS

How to Allocate and Report Cloud Costs with RightScale Optima

Hybrid Cloud Orchestration: How SuperChoice Does It

Multi-Cloud Management with RightScale CMP (Demo)

Similar to AWS Outage Analysis

Netflix0SS Services on DockerDocker, Inc.

Ibm cloud nativenetflixossfinalaspyker

The State of Serverless Computing | AWS Public Sector Summit 2017Amazon Web Services

Your First Hour on AWS: Building the Foundation for Large Scale AWS AdoptionAmazon Web Services

AWS Architecture Fundamentals - HoustonNicole Maus

New ThousandEyes Product Features and Release Highlights: November 2023ThousandEyes

App-First & Cloud-Native: How InterMiles Boosted CX with AWS & InfostretchInfostretch

Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...Amazon Web Services

Enterprise Service Delivery from the AWS Cloud (ARC208) | AWS re:Invent 2013Amazon Web Services

AWS re:Invent 2016: The State of Serverless Computing (SVR311)Amazon Web Services

The Top Outages of 2022: Analysis and TakeawaysThousandEyes

Cloud Services Powered by IBM SoftLayer and NetflixOSSaspyker

AWS Webcast - Splunk and AutodeskAmazon Web Services

EMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptxThousandEyes

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft

How to Build a Big Data Application: Serverless Editionecobold

NetflixOSS for Triangle Devops Oct 2013aspyker

Networking @Scale'19 - Getting a Taste of Your Network - Sergey FedorovSergey Fedorov

Giving credit where credit’s due - myFICO’s cloud transformation - SVC204 - S...Amazon Web Services

Similar to AWS Outage Analysis (20)

Netflix0SS Services on Docker

Ibm cloud nativenetflixossfinal

The State of Serverless Computing | AWS Public Sector Summit 2017

Your First Hour on AWS: Building the Foundation for Large Scale AWS Adoption

AWS Architecture Fundamentals - Houston

New ThousandEyes Product Features and Release Highlights: November 2023

App-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch

Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...

Enterprise Service Delivery from the AWS Cloud (ARC208) | AWS re:Invent 2013

AWS re:Invent 2016: The State of Serverless Computing (SVR311)

The Top Outages of 2022: Analysis and Takeaways

Cloud Services Powered by IBM SoftLayer and NetflixOSS

AWS Webcast - Splunk and Autodesk

EMEA.23.02.23_Top_Outages_of_2022_Webinar_Slides.pptx

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...

How to Build a Big Data Application: Serverless Edition

NetflixOSS for Triangle Devops Oct 2013

Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov

Giving credit where credit’s due - myFICO’s cloud transformation - SVC204 - S...

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Understanding the Laravel MVC ArchitecturePixlogix Infotech

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

CloudStudio User manual (basic edition):comworks

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

AI as an Interface for Commercial BuildingsMemoori

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Install Stable Diffusion in windows machinePadma Pradeep

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Artificial intelligence in the post-deep learning eraDeakin University

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

How to convert PDF to text with Nanonetsnaman860154

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

GenCyber Cyber Security Day PresentationMichael W. Hawkins

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Understanding the Laravel MVC Architecture

SQL Database Design For Developers at php[tek] 2024

CloudStudio User manual (basic edition):

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

08448380779 Call Girls In Friends Colony Women Seeking Men

Azure Monitor & Application Insight to monitor Infrastructure & Application

AI as an Interface for Commercial Buildings

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Pigging Solutions Piggable Sweeping Elbows

Install Stable Diffusion in windows machine

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Artificial intelligence in the post-deep learning era

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

How to convert PDF to text with Nanonets

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

GenCyber Cyber Security Day Presentation

My Hashitalk Indonesia April 2024 Presentation

AWS Outage Analysis

1. 1 AWS Outage

2. 2 Featured Speaker Audil Khan Technical Solutions Architect

3. 3 Before We Begin... • If you have any questions, please type them in the Questions window. • If you have any audio problems, please chat us for help. • A recording of this presentation will be sent to you in a few days. 3 @ThousandEyes

4. 4 Agenda • About ThousandEyes • Key Technical Concepts • Outage Timelines and Details • Demo – ThousandEyes Outage Analysis • Lessons & Takeaways • Q&A 4 @ThousandEyes

5. 5 Actionable Insight for Internet, Cloud, and SaaS Correlated Insights Quickly isolate issues to app, network, or service Network Visibility Overlay, hop-by-hop underlay, ISP performance, and BGP routing App Experience SaaS, API, and internal app performance and user experience

6. 6 Your Network ISP Cloud Provider See the Internet Like It’s Your Own Network Moscow, Russia Paris, France Chicago, IL Visualize the link between network topologies and service delivery Rapidly isolate problem domain and owner

7. 7 ThousandEyes Collective Intelligence 20K+ Vantage Points Billions Daily Path Measurements Thousands Digital Services 110+ Countries 1100+ Cities

8. 8 ThousandEyes Internet Insights: App Outages Dev Tools Communication Tools Human Resources Social Networking Finance eCommerce Sales & Marketing Collaboration Tools Top Business SaaS Apps • Global View of SaaS App Availability • Accelerated & Empowered IT Operations • Data-driven Vendor Governance

9. Key Technical Concepts

10. 10 Amazon Web Services – At a Glance • Availability Zones • Key Components – EC2 – compute – S3 – storage – API Gateway • Ecosystem – 200+ services • US-EAST-1 outsized interdependency

11. 11 Application Programming Interface (API) • Enables communication between disparate applications/systems • Increased application complexity • Interdependencies and domino effects

12. 12 Amazon API Gateway • Gatekeeper for backend APIs in AWS • Capable of processing hundreds of thousands of concurrent API calls • AWS offers internal services to customers via API Gateway

13. Outage Timelines and Details

14. 14 12/7 - Event Sequence as Observed by ThousandEyes 1532 UTC – Outage Begins 1535 UTC - Server Response Failures 1640 UTC - AWS Status Page - First Mention 1712 UTC – AWS API Transaction Times Increase 0100 UTC – Return to Normal

15. 15 12/7 - Event Sequence from Amazon RCA 1530 UTC – Multiple services impacted due to congestion from automated activity 1533 UTC – EC2 API errors and increased latency 1728 UTC – Internal DNS remediation, issues still persist Ongoing network congestion remediation measures 2134 UTC – Significant alleviation of network congestion 2135 UTC – Container API begins to return to normal 2222 UTC – Network devices and AWS Console access “all clear” 2230 UTC - Route 53 APIs "all clear" 2240 UTC - EC2 "all clear" 0041 UTC – API Gateways recovered

16. 16 12/10 - Event Sequence as Observed by ThousandEyes 1305 UTC – Outage Begins Server Response Failures Brief Clear, Followed by Resumption 1430 UTC – Return to Normal

17. Demo

18. 18 Lessons and Takeaways • Understand your network and application interdependencies – Front-end interfaces often depend on many back-end APIs • How does your cloud provider work? – Understand architecture and interdependencies – Single AZ, multi-AZ, multi-cloud – AWS ≠ Azure ≠ GCP • Inform your Incident Response / Outage Management – Specific guidance when issues take place – Example: we’re seeing 2x API responses and it is impacting x, y, z across all zones • Independent visibility and verification is needed – Don’t just depend on the status page!

19. 19 @ThousandEyes Learn more Free Trial / Demo Next Steps • Subscribe! https://blog.thousandeyes.com • Get a real-time view of the health of the Internet https://thousandeyes.com/outages • Sign up for a Free Trial: https://www.thousandeyes.com/signup • Request a demo: https://www.thousandeyes.com/request-demo

20. Q&A

AWS Outage Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Outage Analysis

Similar to AWS Outage Analysis (20)

More from ThousandEyes

More from ThousandEyes (20)

Recently uploaded

Recently uploaded (20)

AWS Outage Analysis