Elevated Outage Adra Balancer
Incident Report for Adra Suite
Postmortem

Summary:

Due to a faulty Balancer web service instance several of our customers had problems logging in. Following the mitigation of this issue, the service was underequipped to handle the sudden influx of traffic, causing performance issues for many of our customers.

We apologize for the inconvenience caused.

Description of issue:

At 07:44 CET, our automatic scaling mechanism provisioned an instance for the Balancer service which contained a missing component, causing most calls routed to this instance to fail due to a previously unseen error. An alert was sent to our SaaS Operations team due to an increase in service response time, but the root cause was missed as a service test and telemetry overview did not uncover the issue.

Further reports of customer inability to log in were brought to SaaS Operations at approx. 09:00 CET, and after verification of the issue, the offending instance was removed and service accessibility was restored.

Following the restoration of the service and numerous users ability to now log in caused a very large increase in server load which had been unnaturally low, and the autoscaler was not equipped to handle the sudden increase. A manual override of the autoscaling was put in place, and service was restored at 09:40 CET.

Follow-up actions taken:

  • Improve automatic scaling to better accommodate sharp increase in load on service:

When the load on the system rises sharply, we are better equipped to increase our available resources in a more timely fashion to meet the increased demand.

  • Improved monitoring to increase sensitivity to and severity of novel service exceptions:

When new exceptions arise in any of our services, our Operations team will be alerted with a higher severity alert to ensure that we investigate, document and categorize any unforeseen service behavior in a timely fashion.

Posted Oct 27, 2023 - 08:24 UTC

Resolved
Between 8.30 and 09:00 CET, we experienced a disruption of service for Balancer customers hosted in the EU.

While the root cause is still being determined, the issue is fully resolved.
Posted Oct 03, 2023 - 06:30 UTC