SlideShare a Scribd company logo
1 of 44
Presented in 2013
^
End Slide
Microsoft Confidential – Internal Use Only
Microsoft Confidential – Internal Use Only
Microsoft Confidential – Internal Use Only
Show
Me the
Model!
Microsoft Confidential – Internal Use Only
SCALE
^
• Business focused
• Identify the expected lifecycle of
each workload
• Across the year by month
• Across the week by day
• Across the day by hour
• Special periods as appropriate
• Holidays
• Game Days vs. Non-Game Days
• Don’t start with 9s – start with
looser terms
• None, Low, Medium, High, Very High, Highest
Lifecycle
Modeling
• Translates business lifecycle to 9s
• What level of uptime must be achieved to meet
the needs of the lifecycle model
• Helps rationalize desired SLA vs.
ability and cost to deliver
• Guides architecture and
technical decisions
• Guides capacity planning
• Guides selection of 3rd party services
• Prepares for prioritized resiliency planning
Availability
Modeling
• Scale out vs. Scale Up
• Define units of scale
• Benefits testing
• Can be used with automated scale up/down
• Helps with cost modeling
Scale Units
• Identify failure points
• Component interactions
• Identify failure modes
• Discovery
• Incorrectness
• Auth
• Limits/Latency
• Assess risk priority
• Impact and likelihood
RMA
(Resilience Modeling
and Analysis)
Pre-work Discover Rate Act
Pre-work
Discover
Discovery
 Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
 Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
Incorrectness
 Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
 Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
 Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Auth
 Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
 Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Limits/Latency
 Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
 Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
 Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Limits/Latency
 Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
 Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
 Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Auth
 Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
 Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Incorrectness
 Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
 Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
 Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Discovery
 Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
 Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
Discover
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.
4
Queue Integration Service -> Queue
Service
No Response from
Queue Service
The Queue service may be unresponsive for an extended period of time.
Buffer locally the first 50 requests for later play back. Discard requests
after buffer reaches 50. Less than 1% of users are both CloudStore and
ProductivityClient users which are the only ones that would see impact.
Observer monitors in real time attempts it makes and responses via perf
counters. Observer called by framework uses those counters to decide
if it's in a healthy state.
4
Queue Integration Service -> Queue
Service
Invalid Client Certificate
The Client certificate on the role instance for use with the Queue
Integration service may be invalid/expired or the Queue service may
make a breaking change which invalidates the Client certificate.
All calls to Queue service would fail. Monitoring probes from other
datacenters will pick this up. Service clients have no cached copy of data
so the functionality will be blank for the user on Read and no data can be
saved via Write.
1B Web Service -> Server API Invalid SSL Certificate
The SSL certificate for IIS may be invalid/expired when servicing service
clients.
Return SSL error to caller. Monitoring probes from other datacenters
will pick this up. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
1B Web Service -> Server API Latency from Server API
The Server API may be slow to respond from calls originating from
outside the USA due to the web service infrastructure's only location
being Midwest.
Caller will timeout resulting in a blank functionality for example. 6-10%
of calls from Southeast Asia consistently fail. Overall user base would
be <2%. Monitoring has not caught this problem since it's too transient.
7
Azure Software Load Balancer -> Web
Roles
Azure SLB cannot talk to
Web Roles::ServerAPI
The Azure SLB may be unable to communicate with any of the Web Role
instances for service clients.
Error 404 returned to the caller. Probes will detect this error. This has
not been seen in production yet. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
8 Azure DNS
Azure DNS
Failure::ServerAPI
The Azure DNS system may fail resulting in the inability of service clients
to resolve the DNS name of the Contoso service.
Error DNS not found returned to the caller. Probes will detect. This has
not yet been seen in production. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
9 Midwest Datacenter
Midwest Datacenter
Outage
Contoso online service may be completely offline due to an outage of
the Midwest Datacenter.
Contoso will be offline until the Midwest Datacenter service is restored.
There is no failover datacenter for Contoso. Outside-in type testing
would detect the service failure.
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.
Rate
Act
Fail safe modeling for cloud services and applications
Fail safe modeling for cloud services and applications

More Related Content

Viewers also liked

Unit 1 review
Unit 1 reviewUnit 1 review
Unit 1 review
cblockus
 
20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation
FINN.no
 

Viewers also liked (14)

Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1Autorizacoes para levantamentos_hidrograficos_2015-1
Autorizacoes para levantamentos_hidrograficos_2015-1
 
Unit 1 review
Unit 1 reviewUnit 1 review
Unit 1 review
 
Modular development
Modular developmentModular development
Modular development
 
The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...The Digital Advisor - Using technology to improve an advisory sales process f...
The Digital Advisor - Using technology to improve an advisory sales process f...
 
Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)Formularis - Post/Redirect/Get (ca)
Formularis - Post/Redirect/Get (ca)
 
The Thinking behind BEM
The Thinking behind BEMThe Thinking behind BEM
The Thinking behind BEM
 
3 g
3 g3 g
3 g
 
Windows
WindowsWindows
Windows
 
Mobile web-debug
Mobile web-debugMobile web-debug
Mobile web-debug
 
Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?Du vil vel ikke mamma noe vondt?
Du vil vel ikke mamma noe vondt?
 
20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation20111011 Geek Girls - Innovation
20111011 Geek Girls - Innovation
 
Energytransferlab
EnergytransferlabEnergytransferlab
Energytransferlab
 
Session 2 - Q&A
Session 2 - Q&ASession 2 - Q&A
Session 2 - Q&A
 
Update on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic AccountingUpdate on the UN System of Environmental-Economic Accounting
Update on the UN System of Environmental-Economic Accounting
 

Similar to Fail safe modeling for cloud services and applications

SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 

Similar to Fail safe modeling for cloud services and applications (20)

Key to optimal end user experience
Key to optimal end user experienceKey to optimal end user experience
Key to optimal end user experience
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API VulnerabilitiesBlack and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
Black and Blue APIs: Attacker's and Defender's View of API Vulnerabilities
 
SQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should KnowSQL Server Wait Types Everyone Should Know
SQL Server Wait Types Everyone Should Know
 
Geek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL SleuthGeek Sync I CSI for SQL: Learn to be a SQL Sleuth
Geek Sync I CSI for SQL: Learn to be a SQL Sleuth
 
eduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPseduroam diagnostics in NTLR, IdPs and SPs
eduroam diagnostics in NTLR, IdPs and SPs
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Visibility-from web application interface to the database
Visibility-from web application interface to the databaseVisibility-from web application interface to the database
Visibility-from web application interface to the database
 
Software Performance
Software Performance Software Performance
Software Performance
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
BeJUG JAX-RS Event
BeJUG JAX-RS EventBeJUG JAX-RS Event
BeJUG JAX-RS Event
 
APIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go StreamingAPIs, STOP Polling, lets go Streaming
APIs, STOP Polling, lets go Streaming
 
L12 Session State and Distributation Strategies
L12 Session State and Distributation StrategiesL12 Session State and Distributation Strategies
L12 Session State and Distributation Strategies
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 
L20 Scalability
L20 ScalabilityL20 Scalability
L20 Scalability
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Unit 2 oracle9i
Unit 2  oracle9i Unit 2  oracle9i
Unit 2 oracle9i
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspective
 
BITM3730 11-1.pptx
BITM3730 11-1.pptxBITM3730 11-1.pptx
BITM3730 11-1.pptx
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 

More from Marc Mercuri

More from Marc Mercuri (9)

Architecting world class azure resource manager templates
Architecting world class azure resource manager templatesArchitecting world class azure resource manager templates
Architecting world class azure resource manager templates
 
Architecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public CloudsArchitecting Solutions That Span Private and Public Clouds
Architecting Solutions That Span Private and Public Clouds
 
Architecting fail safe data services
Architecting fail safe data servicesArchitecting fail safe data services
Architecting fail safe data services
 
Architecting with a 'cloud first' mindset
Architecting  with a 'cloud first' mindsetArchitecting  with a 'cloud first' mindset
Architecting with a 'cloud first' mindset
 
Services symposium 2013 failsafe in 15 minutes
Services symposium 2013   failsafe in 15 minutesServices symposium 2013   failsafe in 15 minutes
Services symposium 2013 failsafe in 15 minutes
 
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
Predictive maintenance - Architecting a Solution with Devices, Services, Big ...
 
Internet of Things: Opportunities and Architectures
Internet of Things: Opportunities and ArchitecturesInternet of Things: Opportunities and Architectures
Internet of Things: Opportunities and Architectures
 
FailSafe IaaS
FailSafe IaaSFailSafe IaaS
FailSafe IaaS
 
Failsafe 1 hour 2013
Failsafe 1 hour   2013Failsafe 1 hour   2013
Failsafe 1 hour 2013
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Recently uploaded (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 

Fail safe modeling for cloud services and applications

  • 2.
  • 3. ^
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Microsoft Confidential – Internal Use Only
  • 10. Microsoft Confidential – Internal Use Only
  • 11. Microsoft Confidential – Internal Use Only
  • 12.
  • 13.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Microsoft Confidential – Internal Use Only
  • 24.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. • Business focused • Identify the expected lifecycle of each workload • Across the year by month • Across the week by day • Across the day by hour • Special periods as appropriate • Holidays • Game Days vs. Non-Game Days • Don’t start with 9s – start with looser terms • None, Low, Medium, High, Very High, Highest Lifecycle Modeling
  • 31.
  • 32. • Translates business lifecycle to 9s • What level of uptime must be achieved to meet the needs of the lifecycle model • Helps rationalize desired SLA vs. ability and cost to deliver • Guides architecture and technical decisions • Guides capacity planning • Guides selection of 3rd party services • Prepares for prioritized resiliency planning Availability Modeling
  • 33.
  • 34. • Scale out vs. Scale Up • Define units of scale • Benefits testing • Can be used with automated scale up/down • Helps with cost modeling Scale Units
  • 35.
  • 36. • Identify failure points • Component interactions • Identify failure modes • Discovery • Incorrectness • Auth • Limits/Latency • Assess risk priority • Impact and likelihood RMA (Resilience Modeling and Analysis)
  • 39. Discover Discovery  Caller cannot locate the resource due to configuration errors o Configuration source is incorrect o Configuration source is missing o Configuration source is corrupt o Network configuration prevents connection (e.g. ACL, firewall)  Caller cannot locate the resource due to name resolution errors o Name resolution service is not responsive o Name resolution service has a missing resource record o Name resolution has a stale or corrupt resource record Incorrectness  Caller receives an error because the request is incorrect. o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request) o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)  Request does not complete due to corrupt or malformed data o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.) o Data returned to caller is not what was expected (e.g. incorrect record entry) o Poison message prevents resource or caller from processing  Caller receives an error because of bad assumed context o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started) o Non-idempotent transaction errors (e.g. resource already exists) o Resource is missing (404 Not Found, Database, Table, Row, File, etc.) o Timing is incorrect (e.g. events happen in the wrong order) Auth  Caller receives an authentication error o authentication service unavailable o account doesn't exist, account expired, password incorrect o certificate incorrect or expired  Caller receives an authorization failure o access denied to resource (e.g. 403 Forbidden) Limits/Latency  Caller receives no response from the resource resulting in timeout or blocking on caller o Time out even after successful connect (e.g. process deadlocked) o Time out errors because of resource load (e.g. out of storage, memory, processing) o Time out errors due to network (e.g. capacity or latency) o Requests simply dropped by network or resource  Caller receives an error related to exceeding limits on the resource o Unspecified errors (e.g. 500 Internal Server Error) o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests) o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length) o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities) o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows) o Request flooding (e.g. DDoS, malicious or self-inflicted)  Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller o Heavy loads on resource can cause slow response times o Network congestion or latency Limits/Latency  Caller receives no response from the resource resulting in timeout or blocking on caller o Time out even after successful connect (e.g. process deadlocked) o Time out errors because of resource load (e.g. out of storage, memory, processing) o Time out errors due to network (e.g. capacity or latency) o Requests simply dropped by network or resource  Caller receives an error related to exceeding limits on the resource o Unspecified errors (e.g. 500 Internal Server Error) o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests) o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length) o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities) o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows) o Request flooding (e.g. DDoS, malicious or self-inflicted)  Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller o Heavy loads on resource can cause slow response times o Network congestion or latency Auth  Caller receives an authentication error o authentication service unavailable o account doesn't exist, account expired, password incorrect o certificate incorrect or expired  Caller receives an authorization failure o access denied to resource (e.g. 403 Forbidden) Incorrectness  Caller receives an error because the request is incorrect. o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request) o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)  Request does not complete due to corrupt or malformed data o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.) o Data returned to caller is not what was expected (e.g. incorrect record entry) o Poison message prevents resource or caller from processing  Caller receives an error because of bad assumed context o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started) o Non-idempotent transaction errors (e.g. resource already exists) o Resource is missing (404 Not Found, Database, Table, Row, File, etc.) o Timing is incorrect (e.g. events happen in the wrong order) Discovery  Caller cannot locate the resource due to configuration errors o Configuration source is incorrect o Configuration source is missing o Configuration source is corrupt o Network configuration prevents connection (e.g. ACL, firewall)  Caller cannot locate the resource due to name resolution errors o Name resolution service is not responsive o Name resolution service has a missing resource record o Name resolution has a stale or corrupt resource record
  • 40. Discover ID Component / Dependency Interaction Failure Short Name Failure Description Response 3 Storage Layer -> Azure Storage Error 5xx from Azure Storage::ServerAPI Azure Storage may respond with ServerBusy or OperationTimedOut when the web role is attempting to read/write data in one of the Storage tables on behalf of a service client. Return Error to caller. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 2A Client API -> Relying Party Suite Data Encryption Key certificate invalid The RPS component may have an invalid/expired DEK certificate Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. Recovery requires human intervention. 2B Client API -> OrgID RPS No Response from OrgID RPS The ClientAPI may not receive a response from the OrgID RPS worker role. Due to the limited number of instances of the OrgID RPS role there may be a combination of events that take down one instance in a Fault Domain and another in a Update Domain concurrently. This may be coupled with a capacity issue on the remaining instance. Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. 4 Queue Integration Service -> Queue Service No Response from Queue Service The Queue service may be unresponsive for an extended period of time. Buffer locally the first 50 requests for later play back. Discard requests after buffer reaches 50. Less than 1% of users are both CloudStore and ProductivityClient users which are the only ones that would see impact. Observer monitors in real time attempts it makes and responses via perf counters. Observer called by framework uses those counters to decide if it's in a healthy state. 4 Queue Integration Service -> Queue Service Invalid Client Certificate The Client certificate on the role instance for use with the Queue Integration service may be invalid/expired or the Queue service may make a breaking change which invalidates the Client certificate. All calls to Queue service would fail. Monitoring probes from other datacenters will pick this up. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 1B Web Service -> Server API Invalid SSL Certificate The SSL certificate for IIS may be invalid/expired when servicing service clients. Return SSL error to caller. Monitoring probes from other datacenters will pick this up. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 1B Web Service -> Server API Latency from Server API The Server API may be slow to respond from calls originating from outside the USA due to the web service infrastructure's only location being Midwest. Caller will timeout resulting in a blank functionality for example. 6-10% of calls from Southeast Asia consistently fail. Overall user base would be <2%. Monitoring has not caught this problem since it's too transient. 7 Azure Software Load Balancer -> Web Roles Azure SLB cannot talk to Web Roles::ServerAPI The Azure SLB may be unable to communicate with any of the Web Role instances for service clients. Error 404 returned to the caller. Probes will detect this error. This has not been seen in production yet. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 8 Azure DNS Azure DNS Failure::ServerAPI The Azure DNS system may fail resulting in the inability of service clients to resolve the DNS name of the Contoso service. Error DNS not found returned to the caller. Probes will detect. This has not yet been seen in production. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 9 Midwest Datacenter Midwest Datacenter Outage Contoso online service may be completely offline due to an outage of the Midwest Datacenter. Contoso will be offline until the Midwest Datacenter service is restored. There is no failover datacenter for Contoso. Outside-in type testing would detect the service failure. ID Component / Dependency Interaction Failure Short Name Failure Description Response 3 Storage Layer -> Azure Storage Error 5xx from Azure Storage::ServerAPI Azure Storage may respond with ServerBusy or OperationTimedOut when the web role is attempting to read/write data in one of the Storage tables on behalf of a service client. Return Error to caller. Service clients have no cached copy of data so the functionality will be blank for the user on Read and no data can be saved via Write. 2A Client API -> Relying Party Suite Data Encryption Key certificate invalid The RPS component may have an invalid/expired DEK certificate Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently. Recovery requires human intervention. 2B Client API -> OrgID RPS No Response from OrgID RPS The ClientAPI may not receive a response from the OrgID RPS worker role. Due to the limited number of instances of the OrgID RPS role there may be a combination of events that take down one instance in a Fault Domain and another in a Update Domain concurrently. This may be coupled with a capacity issue on the remaining instance. Return Authentication error to caller. Only users that will see effects of this are those using multiple clients which is a small number of users (<2%). Those affected users will be missing data. Probes are pinging for this very frequently.
  • 41. Rate
  • 42. Act