This document discusses Office 365 service availability and reliability. It provides charts showing over 99.9% uptime for Office 365 applications across different regions over a 12 month period. It describes redundancy measures, resiliency practices, and monitoring used to maintain service levels. These include physical and data redundancy, load balancing, automated recovery, and detailed logging. Additional details are provided on incident response and reporting through the Service Health Dashboard.
5. 99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
Americas Region
99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
European Region
99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
Asia-Pacific Region
In a 12 month period, the uptime of O365 applications averaged > 99.9%
6. Redundancy
Physical redundancy
Data redundancy
Functional redundancy
Resiliency
Active load balancing
Recovery across “failure
domains” regularly tested
Human backup
Automated recovery alerts
24x7 on-call engineer
On-call engineers are core
product group members
Distributed Workloads
Distributed components
are more resilient
Most failures are contained
to a single service.
Service component isolation
Complexity avoidance
and graceful degradation
Standardized hardware
Fully automated
deployment
Built-in workload
management mechanisms
Inspectability and
predictability
Detailed log and tracing
Deep internal monitoring
augmented by extensive
outside-in monitoring
diagnostics
8. Incident Status
Status Description
SHD
icon
Investigating
Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential service
incident. Microsoft is currently investigating.
Service Interruption
Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate action to
understand the cause of the failure and determine best course of action to restore service.
Service Degradation
Services are still active, but service responsiveness and/or delivery times may be slower than usual. Microsoft
is working to restore normal service responsiveness.
Restoring Service Microsoft has isolated the likely cause of the incident and is in the process of restoring service
Extended Recovery Services are restored and may be slower than usual
Service Restored Normal system services have been restored
False Positive The service is healthy and a service incident did not actually occur
Additional Information There is additional information provided
Normal Service The service is healthy
?
18. Are published for Service Availability issues that span multiple customers
Available within 5 business days
PIR downloadable document accessible from SHD
A PIR includes:
• Incident Information
• Summary
• Customer Impact
• Incident Start Date and Time
• Root Cause
• Next Steps
30 day historical view in SHD
21. Type Description Channel
Planned Maintenance Update
• 5 business days prior notification of planned service
maintenance.
• Notification includes start and end time.
• Service Health Dashboard
• RSS Admin Feed (for
subscribed admins)
22. Transparent non-customer impacting service hygiene
More detailed information and programmatic approach around
service updates and service incidents
Tenant Level Reporting
Service Health Dashboard Customer Preview Programs
Service Communication Panel Concept
45. Service health summary with quick
access to detailed dashboard
Simplified navigation bar with
quick access to all workloads
Reports on service usage
and performance
45
46. Manage mailboxes, groups and objects
Search for properties
Conduct an advanced search
Manage roles and permissions
Create policies
Track message delivery
46
51. The objective is to describe the risk of
outage to an individual customer based
on the aggregate uptime of the service.
Longer outages have greater impact to
the percentage
Outages that affect a greater number of users
have greater impact
More severe outages in terms of users or
duration lead to greater deviations from 100%,
which can be used for remedy service credits.
The Office 365 service level
agreement expresses uptime
in this way:
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
× 100%
The aggregate uptime of service
components can be expressed
similarly.
52. Hardware or software failures
Monitoring alerts
Service incidents
Customer reported incidents
56. Executed from two+ locations to
ensure accuracy and redundancy
Simulates full end user and system
transactions
Supports every major system and user
scenario
Failures at any point are turned into
alerts and escalated to engineers