When we see cloud services and applications fail at scale, there's typically a common source of blame - architecture. Not understanding the lifecycle, availability, capacity and failure modes of your services can - an often does - leads you to fail at scale. Modeling these areas can take a modest amount of time up front and help you avoid weeks or even months of intensive of firefighting later. This session covers key considerations and approaches to modeling and highlights a straightforward set of templates to use developed by Microsoft Consulting Services and Trustworthy Computing.
30. • Business focused
• Identify the expected lifecycle of
each workload
• Across the year by month
• Across the week by day
• Across the day by hour
• Special periods as appropriate
• Holidays
• Game Days vs. Non-Game Days
• Don’t start with 9s – start with
looser terms
• None, Low, Medium, High, Very High, Highest
Lifecycle
Modeling
31.
32. • Translates business lifecycle to 9s
• What level of uptime must be achieved to meet
the needs of the lifecycle model
• Helps rationalize desired SLA vs.
ability and cost to deliver
• Guides architecture and
technical decisions
• Guides capacity planning
• Guides selection of 3rd party services
• Prepares for prioritized resiliency planning
Availability
Modeling
33.
34. • Scale out vs. Scale Up
• Define units of scale
• Benefits testing
• Can be used with automated scale up/down
• Helps with cost modeling
Scale Units
39. Discover
Discovery
Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
Incorrectness
Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Auth
Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Limits/Latency
Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Limits/Latency
Caller receives no response from the resource resulting in timeout or blocking on caller
o Time out even after successful connect (e.g. process deadlocked)
o Time out errors because of resource load (e.g. out of storage, memory, processing)
o Time out errors due to network (e.g. capacity or latency)
o Requests simply dropped by network or resource
Caller receives an error related to exceeding limits on the resource
o Unspecified errors (e.g. 500 Internal Server Error)
o Metering on resource (e.g. 503 Server Unavailable or 429 Too Many Requests)
o Resource exhaustion (e.g. insufficient storage, memory, processing, or queue length)
o Sharing contention (e.g. sharing of resource with other services, components, or maintenance activities)
o Unbounded, unconstrained requests or responses (e.g. expected one row but returned one million rows)
o Request flooding (e.g. DDoS, malicious or self-inflicted)
Caller receives a success response but at a very slow rate causing queue lengths to exceed on caller
o Heavy loads on resource can cause slow response times
o Network congestion or latency
Auth
Caller receives an authentication error
o authentication service unavailable
o account doesn't exist, account expired, password incorrect
o certificate incorrect or expired
Caller receives an authorization failure
o access denied to resource (e.g. 403 Forbidden)
Incorrectness
Caller receives an error because the request is incorrect.
o Protocol violation (e.g. invalid parameters passed by caller, 400 Bad Request)
o Version mismatch (e.g. N-1 or greater not supported, 405 Method Not Allowed)
Request does not complete due to corrupt or malformed data
o Read/Write failed due to resource corruption (e.g. disk, file, db, table, etc.)
o Data returned to caller is not what was expected (e.g. incorrect record entry)
o Poison message prevents resource or caller from processing
Caller receives an error because of bad assumed context
o Resource is in invalid state to complete request (e.g. del DIR that is not empty, start something already started)
o Non-idempotent transaction errors (e.g. resource already exists)
o Resource is missing (404 Not Found, Database, Table, Row, File, etc.)
o Timing is incorrect (e.g. events happen in the wrong order)
Discovery
Caller cannot locate the resource due to configuration errors
o Configuration source is incorrect
o Configuration source is missing
o Configuration source is corrupt
o Network configuration prevents connection (e.g. ACL, firewall)
Caller cannot locate the resource due to name resolution errors
o Name resolution service is not responsive
o Name resolution service has a missing resource record
o Name resolution has a stale or corrupt resource record
40. Discover
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.
4
Queue Integration Service -> Queue
Service
No Response from
Queue Service
The Queue service may be unresponsive for an extended period of time.
Buffer locally the first 50 requests for later play back. Discard requests
after buffer reaches 50. Less than 1% of users are both CloudStore and
ProductivityClient users which are the only ones that would see impact.
Observer monitors in real time attempts it makes and responses via perf
counters. Observer called by framework uses those counters to decide
if it's in a healthy state.
4
Queue Integration Service -> Queue
Service
Invalid Client Certificate
The Client certificate on the role instance for use with the Queue
Integration service may be invalid/expired or the Queue service may
make a breaking change which invalidates the Client certificate.
All calls to Queue service would fail. Monitoring probes from other
datacenters will pick this up. Service clients have no cached copy of data
so the functionality will be blank for the user on Read and no data can be
saved via Write.
1B Web Service -> Server API Invalid SSL Certificate
The SSL certificate for IIS may be invalid/expired when servicing service
clients.
Return SSL error to caller. Monitoring probes from other datacenters
will pick this up. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
1B Web Service -> Server API Latency from Server API
The Server API may be slow to respond from calls originating from
outside the USA due to the web service infrastructure's only location
being Midwest.
Caller will timeout resulting in a blank functionality for example. 6-10%
of calls from Southeast Asia consistently fail. Overall user base would
be <2%. Monitoring has not caught this problem since it's too transient.
7
Azure Software Load Balancer -> Web
Roles
Azure SLB cannot talk to
Web Roles::ServerAPI
The Azure SLB may be unable to communicate with any of the Web Role
instances for service clients.
Error 404 returned to the caller. Probes will detect this error. This has
not been seen in production yet. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
8 Azure DNS
Azure DNS
Failure::ServerAPI
The Azure DNS system may fail resulting in the inability of service clients
to resolve the DNS name of the Contoso service.
Error DNS not found returned to the caller. Probes will detect. This has
not yet been seen in production. Service clients have no cached copy of
data so the functionality will be blank for the user on Read and no data
can be saved via Write.
9 Midwest Datacenter
Midwest Datacenter
Outage
Contoso online service may be completely offline due to an outage of
the Midwest Datacenter.
Contoso will be offline until the Midwest Datacenter service is restored.
There is no failover datacenter for Contoso. Outside-in type testing
would detect the service failure.
ID Component / Dependency Interaction Failure Short Name Failure Description Response
3 Storage Layer -> Azure Storage
Error 5xx from Azure
Storage::ServerAPI
Azure Storage may respond with ServerBusy or OperationTimedOut
when the web role is attempting to read/write data in one of the
Storage tables on behalf of a service client.
Return Error to caller. Service clients have no cached copy of data so the
functionality will be blank for the user on Read and no data can be saved
via Write.
2A Client API -> Relying Party Suite
Data Encryption Key
certificate invalid
The RPS component may have an invalid/expired DEK certificate
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently. Recovery requires human intervention.
2B Client API -> OrgID RPS
No Response from OrgID
RPS
The ClientAPI may not receive a response from the OrgID RPS worker
role. Due to the limited number of instances of the OrgID RPS role there
may be a combination of events that take down one instance in a Fault
Domain and another in a Update Domain concurrently. This may be
coupled with a capacity issue on the remaining instance.
Return Authentication error to caller. Only users that will see effects of
this are those using multiple clients which is a small number of users
(<2%). Those affected users will be missing data. Probes are pinging for
this very frequently.