Between 08:00 and 12:51 BST on 05/07/2019 we identified an issue accessing all Kallidus Perform, Learn and Classic LMS services. The issue was caused by an underlying problem with our hosting provider, Microsoft Azure, experiencing an outage with their storage infrastructure which was not limited to Kallidus. This caused various services used by Kallidus to be unavailable for the duration of this incident. We are reliant on Microsoft to undertake remedial action in the event of an Azure platform issue.
Microsoft Azure has provided a preliminary root cause, which is subject to change as they continue to investigate. If there are material changes to this text we will send out an update. Their response is below:
Summary of impact: Between 06:00 UTC and 16:25 UTC on 04 July 2019, a subset of customers leveraging Storage in UK South may have experienced service availability issues. In addition, resources with dependencies on Storage, may also have experienced downstream impact in the form of availability issues.
Preliminary root cause: Engineers identified high levels of resource utilization on a single storage scale unit. As a result, services dependent on the storage scale unit experienced a high number of failures and latency which manifested in availability issues.
Mitigation: Engineers manually applied load balancing configuration changes to bring the affected storage scale unit back to a healthy state. As a consequence, resource utilization levels were brought back to normal mitigating the issue.
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences.
If you require any further information please contact the Customer Support Team or your account manager
Comments
1 comment
Thanks @Arun - helpful summary. I take it the challenge with Microsoft is to reduce the need for manual load balancing optimisation by automating load balancing on failover?
Please sign in to leave a comment.