SlideShare a Scribd company logo
1 of 44
Presented in 2013
^
End Slide
Microsoft Confidential – Internal Use Only
Microsoft Confidential – Internal Use Only
Microsoft Confidential – Internal Use Only
Show
Me the
Model!
Microsoft Confidential – Internal Use Only
SCALE
^
• Business focused
• Identify the expected lifecycle of
each workload
• Across the year by month
• Across the week by day
• Across the day by hour
• Special periods as appropriate
• Holidays
• Game Days vs. Non-Game Days
• Don’t start with 9s – start with
looser terms
• None, Low, Medium, High, Very High, Highest
Lifecycle
Modeling
• Translates business lifecycle to 9s
• What level of uptime must be achieved to meet
the needs of the lifecycle model
• Helps rationalize desired SLA vs.
ability and cost to deliver
• Guides architecture and
technical decisions
• Guides capacity planning
• Guides selection of 3rd party services
• Prepares for prioritized resiliency planning
Availability
Modeling
• Scale out vs. Scale Up
• Define units of scale
• Benefits testing
• Can be used with automated scale up/down
• Helps with cost modeling
Scale Units
• Identify failure points
• Component interactions
• Identify failure modes
• Discovery
• Incorrectness
• Auth
• Limits/Latency
• Assess risk priority
• Impact and likelihood
RMA
(Resilience Modeling
and Analysis)
Pre-work Discover Rate Act
Pre-work
Discover
Discovery
 Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
 Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
Incorrectness
 Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
 Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
 Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Auth
 Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
 Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Limits/Latency
 Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
 Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
 Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Limits/Latency
 Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
 Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
 Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Auth
 Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
 Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Incorrectness
 Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
 Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
 Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Discovery
 Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
 Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
Discover
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.
4
Queue Integration Service -> Queue
Service
No Response from
Queue Service
The Queue service may be unresponsive for an extended period of time.
Buffer locally the first 50 requests for later play back. Discard requests
after buffer reaches 50. Less than 1% of users are both CloudStore and
ProductivityClient users which are the only ones that would see impact.
Observer monitors in real time attempts it makes and responses via perf
counters. Observer called by framework uses those counters to decide
if it's in a healthy state.
4
Queue Integration Service -> Queue
Service
Invalid Client Certificate
The Client certificate on the role instance for use with the Queue
Integration service may be invalid/expired or the Queue service may
make a breaking change which invalidates the Client certificate.
All calls to Queue service would fail. Monitoring probes from other
datacenters will pick this up. Service clients have no cached copy of data
so the functionality will be blank for the user on Read and no data can be
saved via Write.
1B Web Service -> Server API Invalid SSL Certificate
The SSL certificate for IIS may be invalid/expired when servicing service
clients.
Return SSL error to caller. Monitoring probes from other datacenters
will pick this up. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
1B Web Service -> Server API Latency from Server API
The Server API may be slow to respond from calls originating from
outside the USA due to the web service infrastructure's only location
being Midwest.
Caller will timeout resulting in a blank functionality for example. 6-10%
of calls from Southeast Asia consistently fail. Overall user base would
be <2%. Monitoring has not caught this problem since it's too transient.
7
Azure Software Load Balancer -> Web
Roles
Azure SLB cannot talk to
Web Roles::ServerAPI
The Azure SLB may be unable to communicate with any of the Web Role
instances for service clients.
Error 404 returned to the caller. Probes will detect this error. This has
not been seen in production yet. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
8 Azure DNS
Azure DNS
Failure::ServerAPI
The Azure DNS system may fail resulting in the inability of service clients
to resolve the DNS name of the Contoso service.
Error DNS not found returned to the caller. Probes will detect. This has
not yet been seen in production. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
9 Midwest Datacenter
Midwest Datacenter
Outage
Contoso online service may be completely offline due to an outage of
the Midwest Datacenter.
Contoso will be offline until the Midwest Datacenter service is restored.
There is no failover datacenter for Contoso. Outside-in type testing
would detect the service failure.
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.
Rate
Act
Fail safe modeling for cloud services and applications
Fail safe modeling for cloud services and applications

More Related Content

Viewers also liked

Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1Flávio Pestana
 
Unit 1 review
Unit 1 reviewUnit 1 review
Unit 1 reviewcblockus
 
The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...The Digital Insurer
 
Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)Carlos Campderrós
 
Mobile web-debug
Mobile web-debugMobile web-debug
Mobile web-debugFINN.no
 
Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?FINN.no
 
20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation20111011 Geek Girls - Innovation
20111011 Geek Girls - InnovationFINN.no
 
Update on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic AccountingUpdate on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic AccountingGaia Manco
 

Viewers also liked (14)

Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1
 
Unit 1 review
Unit 1 reviewUnit 1 review
Unit 1 review
 
Modular development
Modular developmentModular development
Modular development
 
The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...
 
Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)
 
The Thinking behind BEM
The Thinking behind BEMThe Thinking behind BEM
The Thinking behind BEM
 
3 g
3 g3 g
3 g
 
Windows
WindowsWindows
Windows
 
Mobile web-debug
Mobile web-debugMobile web-debug
Mobile web-debug
 
Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?
 
20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation
 
Energytransferlab
EnergytransferlabEnergytransferlab
Energytransferlab
 
Session 2 - Q&A
Session 2 - Q&ASession 2 - Q&A
Session 2 - Q&A
 
Update on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic AccountingUpdate on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic Accounting
 

Similar to Fail safe modeling for cloud services and applications

Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesIvo Andreev
 
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesBlack and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesMatt Tesauro
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowDean Richards
 
Geek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL SleuthGeek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL SleuthIDERA Software
 
eduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPseduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPsKarri Huhtanen
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the databaseManageEngine, Zoho Corporation
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
 
APIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go StreamingAPIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go StreamingPhil Wilkins
 
L12 Session State and Distributation Strategies
L12 Session State and Distributation StrategiesL12 Session State and Distributation Strategies
L12 Session State and Distributation StrategiesÓlafur Andri Ragnarsson
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspectivePriit Piipuu
 
BITM3730 11-1.pptx
BITM3730 11-1.pptxBITM3730 11-1.pptx
BITM3730 11-1.pptxMattMarino13
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 

Similar to Fail safe modeling for cloud services and applications (20)

Key to optimal end user experience
Key to optimal end user experienceKey to optimal end user experience
Key to optimal end user experience
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesBlack and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should Know
 
Geek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL SleuthGeek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL Sleuth
 
eduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPseduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPs
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the database
 
Software Performance
Software Performance Software Performance
Software Performance
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
BeJUG JAX-RS Event
BeJUG JAX-RS EventBeJUG JAX-RS Event
BeJUG JAX-RS Event
 
APIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go StreamingAPIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go Streaming
 
L12 Session State and Distributation Strategies
L12 Session State and Distributation StrategiesL12 Session State and Distributation Strategies
L12 Session State and Distributation Strategies
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 
L20 Scalability
L20 ScalabilityL20 Scalability
L20 Scalability
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Unit 2 oracle9i
Unit 2  oracle9i Unit 2  oracle9i
Unit 2 oracle9i
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspective
 
BITM3730 11-1.pptx
BITM3730 11-1.pptxBITM3730 11-1.pptx
BITM3730 11-1.pptx
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 

More from Marc Mercuri

Architecting world class azure resource manager templates
Architecting world class azure resource manager templatesArchitecting world class azure resource manager templates
Architecting world class azure resource manager templatesMarc Mercuri
 
Architecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public CloudsArchitecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public CloudsMarc Mercuri
 
Architecting fail safe data services
Architecting fail safe data servicesArchitecting fail safe data services
Architecting fail safe data servicesMarc Mercuri
 
Architecting with a 'cloud first' mindset
Architecting  with a 'cloud first' mindsetArchitecting  with a 'cloud first' mindset
Architecting with a 'cloud first' mindsetMarc Mercuri
 
Services symposium 2013 failsafe in 15 minutes
Services symposium 2013   failsafe in 15 minutesServices symposium 2013   failsafe in 15 minutes
Services symposium 2013 failsafe in 15 minutesMarc Mercuri
 
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...Marc Mercuri
 
Internet of Things: Opportunities and Architectures
Internet of Things: Opportunities and ArchitecturesInternet of Things: Opportunities and Architectures
Internet of Things: Opportunities and ArchitecturesMarc Mercuri
 
Failsafe 1 hour 2013
Failsafe 1 hour   2013Failsafe 1 hour   2013
Failsafe 1 hour 2013Marc Mercuri
 

More from Marc Mercuri (9)

Architecting world class azure resource manager templates
Architecting world class azure resource manager templatesArchitecting world class azure resource manager templates
Architecting world class azure resource manager templates
 
Architecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public CloudsArchitecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public Clouds
 
Architecting fail safe data services
Architecting fail safe data servicesArchitecting fail safe data services
Architecting fail safe data services
 
Architecting with a 'cloud first' mindset
Architecting  with a 'cloud first' mindsetArchitecting  with a 'cloud first' mindset
Architecting with a 'cloud first' mindset
 
Services symposium 2013 failsafe in 15 minutes
Services symposium 2013   failsafe in 15 minutesServices symposium 2013   failsafe in 15 minutes
Services symposium 2013 failsafe in 15 minutes
 
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
 
Internet of Things: Opportunities and Architectures
Internet of Things: Opportunities and ArchitecturesInternet of Things: Opportunities and Architectures
Internet of Things: Opportunities and Architectures
 
FailSafe IaaS
FailSafe IaaSFailSafe IaaS
FailSafe IaaS
 
Failsafe 1 hour 2013
Failsafe 1 hour   2013Failsafe 1 hour   2013
Failsafe 1 hour 2013
 

Recently uploaded

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Recently uploaded (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Fail safe modeling for cloud services and applications

  • 2.
  • 3. ^
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Microsoft Confidential – Internal Use Only
  • 10. Microsoft Confidential – Internal Use Only
  • 11. Microsoft Confidential – Internal Use Only
  • 12.
  • 13.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Microsoft Confidential – Internal Use Only
  • 24.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. • Business focused • Identify the expected lifecycle of each workload • Across the year by month • Across the week by day • Across the day by hour • Special periods as appropriate • Holidays • Game Days vs. Non-Game Days • Don’t start with 9s – start with looser terms • None, Low, Medium, High, Very High, Highest Lifecycle Modeling
  • 31.
  • 32. • Translates business lifecycle to 9s • What level of uptime must be achieved to meet the needs of the lifecycle model • Helps rationalize desired SLA vs. ability and cost to deliver • Guides architecture and technical decisions • Guides capacity planning • Guides selection of 3rd party services • Prepares for prioritized resiliency planning Availability Modeling
  • 33.
  • 34. • Scale out vs. Scale Up • Define units of scale • Benefits testing • Can be used with automated scale up/down • Helps with cost modeling Scale Units
  • 35.
  • 36. • Identify failure points • Component interactions • Identify failure modes • Discovery • Incorrectness • Auth • Limits/Latency • Assess risk priority • Impact and likelihood RMA (Resilience Modeling and Analysis)
  • 39. Discover Discovery  Caller cannot locate the resource due to configuration errors o Configuration source is incorrect o Configuration source is missing o Configuration source is corrupt o Network configuration prevents connection (e.g. ACL, firewall)  Caller cannot locate the resource due to name resolution errors o Name resolution service is not responsive o Name resolution service has a missing resource record o Name resolution has a stale or corrupt resource record Incorrectness  Caller receives an error because the request is incorrect. o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request) o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)  Request does not complete due to corrupt or malformed data o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.) o Data returned to caller is not what was expected (e.g. incorrect record entry) o Poison message prevents resource or caller from processing  Caller receives an error because of bad assumed context o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started) o Non-idempotent transaction errors (e.g. resource already exists) o Resource is missing (404 Not Found, Database, Table, Row, File, etc.) o Timing is incorrect (e.g. events happen in the wrong order) Auth  Caller receives an authentication error o authentication service unavailable o account doesn't exist, account expired, password incorrect o certificate incorrect or expired  Caller receives an authorization failure o access denied to resource (e.g. 403 Forbidden) Limits/Latency  Caller receives no response from the resource resulting in timeout or blocking on caller o Time out even after successful connect (e.g. process deadlocked) o Time out errors because of resource load (e.g. out of storage, memory, processing) o Time out errors due to network (e.g. capacity or latency) o Requests simply dropped by network or resource  Caller receives an error related to exceeding limits on the resource o Unspecified errors (e.g. 500 Internal Server Error) o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests) o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length) o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities) o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows) o Request flooding (e.g. DDoS, malicious or self-inflicted)  Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller o Heavy loads on resource can cause slow response times o Network congestion or latency Limits/Latency  Caller receives no response from the resource resulting in timeout or blocking on caller o Time out even after successful connect (e.g. process deadlocked) o Time out errors because of resource load (e.g. out of storage, memory, processing) o Time out errors due to network (e.g. capacity or latency) o Requests simply dropped by network or resource  Caller receives an error related to exceeding limits on the resource o Unspecified errors (e.g. 500 Internal Server Error) o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests) o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length) o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities) o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows) o Request flooding (e.g. DDoS, malicious or self-inflicted)  Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller o Heavy loads on resource can cause slow response times o Network congestion or latency Auth  Caller receives an authentication error o authentication service unavailable o account doesn't exist, account expired, password incorrect o certificate incorrect or expired  Caller receives an authorization failure o access denied to resource (e.g. 403 Forbidden) Incorrectness  Caller receives an error because the request is incorrect. o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request) o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)  Request does not complete due to corrupt or malformed data o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.) o Data returned to caller is not what was expected (e.g. incorrect record entry) o Poison message prevents resource or caller from processing  Caller receives an error because of bad assumed context o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started) o Non-idempotent transaction errors (e.g. resource already exists) o Resource is missing (404 Not Found, Database, Table, Row, File, etc.) o Timing is incorrect (e.g. events happen in the wrong order) Discovery  Caller cannot locate the resource due to configuration errors o Configuration source is incorrect o Configuration source is missing o Configuration source is corrupt o Network configuration prevents connection (e.g. ACL, firewall)  Caller cannot locate the resource due to name resolution errors o Name resolution service is not responsive o Name resolution service has a missing resource record o Name resolution has a stale or corrupt resource record
  • 40. Discover ID Component / Dependency Interaction Failure Short Name Failure Description Response 3 Storage Layer -> Azure Storage Error 5xx from Azure Storage::ServerAPI Azure Storage may respond with ServerBusy or OperationTimedOut when the web role is attempting to read/write data in one of the Storage tables on behalf of a service client. Return Error to caller. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 2A Client API -> Relying Party Suite Data Encryption Key certificate invalid The RPS component may have an invalid/expired DEK certificate Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. Recovery requires human intervention. 2B Client API -> OrgID RPS No Response from OrgID RPS The ClientAPI may not receive a response from the OrgID RPS worker role. Due to the limited number of instances of the OrgID RPS role there may be a combination of events that take down one instance in a Fault Domain and another in a Update Domain concurrently. This may be coupled with a capacity issue on the remaining instance. Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. 4 Queue Integration Service -> Queue Service No Response from Queue Service The Queue service may be unresponsive for an extended period of time. Buffer locally the first 50 requests for later play back. Discard requests after buffer reaches 50. Less than 1% of users are both CloudStore and ProductivityClient users which are the only ones that would see impact. Observer monitors in real time attempts it makes and responses via perf counters. Observer called by framework uses those counters to decide if it's in a healthy state. 4 Queue Integration Service -> Queue Service Invalid Client Certificate The Client certificate on the role instance for use with the Queue Integration service may be invalid/expired or the Queue service may make a breaking change which invalidates the Client certificate. All calls to Queue service would fail. Monitoring probes from other datacenters will pick this up. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 1B Web Service -> Server API Invalid SSL Certificate The SSL certificate for IIS may be invalid/expired when servicing service clients. Return SSL error to caller. Monitoring probes from other datacenters will pick this up. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 1B Web Service -> Server API Latency from Server API The Server API may be slow to respond from calls originating from outside the USA due to the web service infrastructure's only location being Midwest. Caller will timeout resulting in a blank functionality for example. 6-10% of calls from Southeast Asia consistently fail. Overall user base would be <2%. Monitoring has not caught this problem since it's too transient. 7 Azure Software Load Balancer -> Web Roles Azure SLB cannot talk to Web Roles::ServerAPI The Azure SLB may be unable to communicate with any of the Web Role instances for service clients. Error 404 returned to the caller. Probes will detect this error. This has not been seen in production yet. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 8 Azure DNS Azure DNS Failure::ServerAPI The Azure DNS system may fail resulting in the inability of service clients to resolve the DNS name of the Contoso service. Error DNS not found returned to the caller. Probes will detect. This has not yet been seen in production. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 9 Midwest Datacenter Midwest Datacenter Outage Contoso online service may be completely offline due to an outage of the Midwest Datacenter. Contoso will be offline until the Midwest Datacenter service is restored. There is no failover datacenter for Contoso. Outside-in type testing would detect the service failure. ID Component / Dependency Interaction Failure Short Name Failure Description Response 3 Storage Layer -> Azure Storage Error 5xx from Azure Storage::ServerAPI Azure Storage may respond with ServerBusy or OperationTimedOut when the web role is attempting to read/write data in one of the Storage tables on behalf of a service client. Return Error to caller. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 2A Client API -> Relying Party Suite Data Encryption Key certificate invalid The RPS component may have an invalid/expired DEK certificate Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. Recovery requires human intervention. 2B Client API -> OrgID RPS No Response from OrgID RPS The ClientAPI may not receive a response from the OrgID RPS worker role. Due to the limited number of instances of the OrgID RPS role there may be a combination of events that take down one instance in a Fault Domain and another in a Update Domain concurrently. This may be coupled with a capacity issue on the remaining instance. Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently.
  • 41. Rate
  • 42. Act