Technical Issue with dashboard availability

Incident Report for Sumsub

Postmortem

Around 15:50 UTC we received the first alerts on increased system load in our platform. Our automatic scaling system attempted to mitigate the problem by increasing the number of backend instances. During this time a significant amount of requests were still going through, but our system was showing an extreme delay performing any of these actions.This prompted our Engineering team to open an incident report and dive into a full scale investigation. As result we found out that IO was the root cause.
It is important to clarify that our backend relies heavily on a distributed file system provided by AWS. We opened a case with AWS, as we worked around the clock on a plan to make our system responsive again, without knowing that the root cause had started on Amazon side.
Here are some of the actions :
1. We replaced the file system for another one with more aggressive settings. That action showed improvement, but unfortunately did not gave us expected results. This forced us to make some changes on the backend to prevent any performance degradation while working without distributed filesystem at all.

2. Around 21:25 UTC - A fix was rolled out. And we confirmed, the changes made in the system were working with the expected performance.
3. Around 00:15 UTC - AWS acknowledged, there were elevated latencies for the file system and and started investigation on their side
4. Around 02:00 UTC - AWS identified the issue’s root cause and confirmed they are working on a fix. There were no further updates from AWS on the case yet. Although, this incident was not yet reflected on the global AWS status page.

Posted Feb 03, 2023 - 18:14 UTC

Resolved

This incident has been resolved.

Posted Feb 02, 2023 - 22:21 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 02, 2023 - 19:53 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 02, 2023 - 19:40 UTC

Update

We are continuing to investigate this issue.

Posted Feb 02, 2023 - 18:40 UTC

Update

Service has encountered a technical problem with access for majority of clients. No images can be uploaded for applicants.
Our team is investigating the incident.

Posted Feb 02, 2023 - 18:18 UTC

Investigating

We are currently investigating this issue.

Posted Feb 02, 2023 - 18:16 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 02, 2023 - 16:30 UTC

Investigating

Service has encountered a technical problem with access for majority of clients. No images can be uploaded for applicants.
Our team is investigating the incident.

Posted Feb 02, 2023 - 16:30 UTC

This incident affected: API, WebSDK, and MobileSDK.