With every webpage loaded, email sent, or video streamed, network traffic takes a complex journey…
Thursday evening, Microsoft 365 identified a global outage affecting users accessing various Microsoft 365 applications and services. Impacted users suffered from login issues, Azure hosted virtual machines not being available, and constant loading screens in Microsoft 365 services, just to name some of the issues.
Unfortunately, to make matters worse, around the same time as the outage, CrowdStrike released an update that began blue screening devices, however, Microsoft stated the outage was unrelated to the CrowdStrike blue screen. Microsoft has resolved the issue as of posting this and continues to monitor MO821132.
Affected Services
– Microsoft Defender
– Microsoft Defender for Endpoint
– Microsoft Defender Experts
– Microsoft Intune
– Microsoft OneNote
– OneDrive for Business
– SharePoint Online
– Windows 365
– Viva Engage
– Microsoft Purview
– Microsoft Fabric
– PowerBI
– Microsoft Teams
– Microsoft 365 admin center
We’re investigating an issue impacting users ability to access various Microsoft 365 apps and services. More info posted in the admin center under MO821132 and on https://msft.it/6019lRURc
— Microsoft 365 Status (@MSFT365Status) July 18th, 2024
Microsoft acknowledged the issue at 7:13 PM EST stating they would have an update within 30 minutes. This was over an hour after the Exoprise CloudReady sensors first detected the outage, starting around 6:00 PM EST. Customers were able to communicate the outage to their users which prevented a flood of tickets.
Microsoft Ticket MO821132
The Microsoft ticket is available here: Users may be unable to access various Microsoft 365 apps and Services
Incident Start Date and Time
Thursday, July 18, 2024, 7:13 PM EDT
Incident End Date and Time
Thursday, July 18, 2024, 9:55 PM EDT
Scope of impact
This issue may have impacted any user attempting to use various Microsoft 365 apps and services.
Exoprise Analysis
Starting around 6:00 PM EST, CloudReady sensors started reporting errors across multiple Microsoft 365 services. During the outage, the errors reported by CloudReady varied by service but were in fact detected.
At 7:36 PM EST, Microsoft provided an update stating they had begun their investigation into the cause of the issue. They also rerouted affected traffic from affected infrastructure. The failover implemented helped stabilize usage for users while they continued to investigate the root cause of the issue. Even with the failover, CloudReady sensors continued erroring into the next day while Microsoft continued working on a resolution.
On July 19th, 2024 at 11:29 AM EST, Microsoft updated MO821132 stating the issue had been resolved and posted the following to twitter –
After an extended period of monitoring, we’ve determined that the issue is mitigated, and all previously impacted Microsoft 365 apps and service have recovered. For more information, see MO821132 within the admin center.
— Microsoft 365 Status (@MSFT365Status) July 19th, 2024
Root Cause Analysis
During the outage, the CloudReady sensors detected errors for the impacted services. Since multiple services were affected, the errors varied for each type of sensor. The screenshotted errors, from a Teams sensor, indicate that the issue was not consistent. Each run the sensor errored on triggered at different steps in the execution of the sensor.
This was also the case for all of the affected sensors, and although the errors were spread out over time, the impact to the user experience was noticeable.
Preliminary Root Cause From Microsoft
Preliminary root cause: A configuration change in a portion of our Azure backend workloads, caused interruption between storage and compute resources which resulted in connectivity failures that affected downstream Microsoft 365 services dependent on these connections.
Jul 19, 2024, 11:29 AM EDT
After an extended period of monitoring, in addition to internal validations and customer confirmations, we have declared the incident resolved. We understand the impact this incident has had on our users and greatly appreciate your organization’s patience and feedback while we worked to resolve this high priority issue in its entirety. Jul 19, 2024, 10:20 AM EDT
We’ve completed applying our mitigation actions and our telemetry is now indicating all previously impacted Microsoft 365 apps or services are operating normally. We’re now entering a period of monitoring to ensure all previously impacted apps, services, and scenarios have fully recovered.
Jul 19, 2024, 9:48 AM EDT
We’re continuing to resolve the residual impact and we’re monitoring the Microsoft 365 apps and services while they fully recover. Customers should experience incremental recovery as we recover from the remaining impact. This quick update is designed to give the latest information on this issue.
Jul 19, 2024, 8:29 AM EDT
We’re continuing to apply mitigation actions to provide relief from the residual impact affecting the remaining impacted Microsoft 365 apps and services. Our telemetry is indicating that the remaining impacted scenarios are progressing towards a full recovery and we’re closely monitoring to ensure this progress continues.
Jul 19, 2024, 7:52 AM EDT
We’re continuing to apply additional mitigations to fix the residual impact affecting the remaining impacted Microsoft 365 apps and services. Our telemetry indicates that we’re progressing towards full recovery and we’re continuing to monitor. This quick update is designed to give the latest information on this issue
Jul 19, 2024, 6:24 AM EDT
The underlying cause of the issue has been fixed and several Microsoft 365 apps and services have been restored to full functionality. Residual impact is still affecting some Microsoft 365 apps and services, and Microsoft 365 engineering are continuing to conduct additional mitigation actions to provide relief. We’re continuing to observe an increase in functionality and availability for the remaining impacted scenarios and we’re monitoring this closely to ensure we’re progressing towards full recovery. Microsoft is continuing to treat this event with the highest possible priority.
Jul 19, 2024, 5:37 AM EDT
We’re conducting additional mitigation actions to remediate impact for the remaining affected Microsoft 365 apps and services. Our telemetry indicates functionality and availability are continuing to improve across multiple scenarios, and we’re monitoring this closely to ensure we’re progressing towards full recovery. This quick update is designed to give the latest information on this issue.
Jul 19, 2024, 4:28 AM EDT
We’re continuing to see an improvement in service availability across multiple Microsoft 365 apps and services. We’re closely monitoring our telemetry data to ensure this upward trend continues as our mitigation actions continue to progress.
Jul 18, 2024, 8:21 PM EDT
We remain focused on redirecting the impacted traffic to healthy systems as we investigate the root cause. Your organization may experience relief as our mitigation efforts progress. We understand the impact that this issue may have on your organization and we’re continuing to treat this event with the highest priority.
Jul 18, 2024, 7:58 PM EDT
We’re continuing to work on rerouting the impacted traffic to alternate systems to alleviate impact in a more expedient fashion. In parallel, our investigation into the underlying cause of the problem is ongoing. This quick update is designed to give the latest information on this issue.
Jul 18, 2024, 7:36 PM EDT
We’re rerouting affected traffic out of the impacted infrastructure while we continue to investigate the cause of the issue. We’re investigating reports of issues in which users are unable to access various Microsoft 365 apps and services. We’re looking into the cause of this incident and the best mitigation pathway forward. We will provide an update within 30 minutes.