Office 365 service management

99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
Americas Region
99.80%
99.90%
100.00%
European Region
99.80%
99.90%
100.00%
Asia-Pacific Region
In a 12 month period, the uptime of O365 applications averaged > 99.9%

Redundancy
Physical redundancy
Data redundancy
Functional redundancy
Resiliency
Active load balancing
Recovery across “failure
domains” regularly tested
Human backup
Automated recovery alerts
24x7 on-call engineer
On-call engineers are core
product group members
Distributed Workloads
Distributed components
are more resilient
Most failures are contained
to a single service.
Service component isolation
Complexity avoidance
and graceful degradation
Standardized hardware
Fully automated
deployment
Built-in workload
management mechanisms
Inspectability and
predictability
Detailed log and tracing
Deep internal monitoring
augmented by extensive
outside-in monitoring
diagnostics

Additional
Channels
Primary
Channels

Incident Status
Status Description
SHD
icon
Investigating
Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential service
incident. Microsoft is currently investigating.
Service Interruption
Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate action to
understand the cause of the failure and determine best course of action to restore service.
Service Degradation
Services are still active, but service responsiveness and/or delivery times may be slower than usual. Microsoft
is working to restore normal service responsiveness.
Restoring Service Microsoft has isolated the likely cause of the incident and is in the process of restoring service
Extended Recovery Services are restored and may be slower than usual
Service Restored Normal system services have been restored
False Positive The service is healthy and a service incident did not actually occur
Additional Information There is additional information provided
Normal Service The service is healthy
?

Click on “View
history for past
30 days”

Click on
“Incident ID
MO2708””

For Limited Set of Service Incidents
Explanation of Incident
Localized Content

Are published for Service Availability issues that span multiple customers
Available within 5 business days
PIR downloadable document accessible from SHD
A PIR includes:
• Incident Information
• Summary
• Customer Impact
• Incident Start Date and Time
• Root Cause
• Next Steps
30 day historical view in SHD

Click on “Post-
incident report
published”

Type Description Channel
Planned Maintenance Update
• 5 business days prior notification of planned service
maintenance.
• Notification includes start and end time.
• Service Health Dashboard
• RSS Admin Feed (for
subscribed admins)

Transparent non-customer impacting service hygiene
More detailed information and programmatic approach around
service updates and service incidents
Tenant Level Reporting
Service Health Dashboard Customer Preview Programs
Service Communication Panel Concept

Office 365 Community.
http://community.office365.com/en-us/preview/tools/troubleshooting.aspx)
http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#smallbusinesses
http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#enterprises
https://outlook.com/owa.
https://<domain>.sharepoint.com/<pagename>.aspx.
https://<domain>.sharepoint.com/personal/<UserAlias>_<domain>/Documents/Forms/All.aspx

Web browser
Office client
Operating system

2013 20232018 Extended SupportMainstream Support

Service health summary with quick
access to detailed dashboard
Simplified navigation bar with
quick access to all workloads
Reports on service usage
and performance
45

Manage mailboxes, groups and objects
Search for properties
Conduct an advanced search
Manage roles and permissions
Create policies
Track message delivery
46

Edit contact details
Manage groups
Manage voice mail and
phone settings

The objective is to describe the risk of
outage to an individual customer based
on the aggregate uptime of the service.
Longer outages have greater impact to
the percentage
Outages that affect a greater number of users
have greater impact
More severe outages in terms of users or
duration lead to greater deviations from 100%,
which can be used for remedy service credits.
The Office 365 service level
agreement expresses uptime
in this way:
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
× 100%
The aggregate uptime of service
components can be expressed
similarly.

Hardware or software failures
Monitoring alerts
Service incidents
Customer reported incidents

SPO
EXO
Microsoft Online ID
Office 365 Portal
Office 365 Provisioning
Lync

Avoid unnecessary assumptions
by on-call engineers
Isolate issues to root cause

Executed from two+ locations to
ensure accuracy and redundancy
Simulates full end user and system
transactions
Supports every major system and user
scenario
Failures at any point are turned into
alerts and escalated to engineers

Office 365 service management

More Related Content

What's hot

Viewers also liked

Similar to Office 365 service management

More from Motty Ben Atia

Recently uploaded

Office 365 service management