Temporary verification disruptions

Incident Report for Sumsub

Postmortem

Incident Timings

Start time: 19 Jan 16:30 UTC
End time: 19 Jan 17:44 UTC

Incident Summary

As a result of unprecedented surge in traffic, our team identified an issue affecting one of the services in our application infrastructure. This incident caused verifications to hold for approximately 50 minutes. All pending verifications during this time were automatically queued to be processed later. After identifying and fixing the underlying issue, our team ensured all queued operations were successfully completed without any impact on the users or any need for additional actions from our clients.

Root Cause

The incident was primarily caused by temporary downtime in an isolated service cluster, compounded by failover complications. Under peak load, excessive resource usage led to node instability. The failover mechanism did not operate as intended, delaying the restoration of normal functionality.

Action Plan

Our team has planned several improvements in the above-mentioned components of our infrastructure:

Service Optimization: this will make the service less resource-demanding and therefore reducing probability of any outages altogether
Failover Enhancement: Reconfiguring failover mechanisms to increase redundancy
Alerting Optimization: Improve alerting to ensure timely detection of potential issues

Conclusion

We sincerely apologize for any inconvenience this incident may have caused. Ensuring the reliability and stability of our systems is our highest priority, and we are committed to learning from this event. The changes and improvements outlined above will strengthen our infrastructure and reduce the likelihood of similar incidents in the future.
Thank you for your understanding and continued trust in our team and product.
If you have any questions or concerns, please don’t hesitate to contact us our Support team.

Posted Jan 21, 2025 - 17:07 UTC

Resolved

We would like to inform that the issue is now fully resolved, verification time is back to normal, all delayed users are completely processed.

Our Team has conducted additional series of checks and tests to ensure the services remain stable.
We confirm the problem has been fully mitigated as of now.

An unprecedented burst of traffic hit our systems in a very short timeframe, causing an extreme spike in load and leading to the issue.

We have implemented measures to handle increased demand more effectively, and our team will continue monitoring to ensure stable service going forward.

Posted Jan 19, 2025 - 20:11 UTC

Monitoring

We are pleased to report that the internal service responsible for verification processing has been successfully restored.

Our team is now focused on addressing any remaining consequences from the incident—such as clearing backlogs or reconciling delayed verifications and monitoring the system stability.

Posted Jan 19, 2025 - 17:51 UTC

Investigating

One of our internal services, responsible for verifications, has experienced a failure. As a result, some users may encounter delayed verifications.

Verification processes amay experience significant delays, which may affect user onboarding, transaction approvals, or other verification-dependent workflows.
The service downtime may lead to extended waiting periods for end-users or automated processes requiring verification.

The engineering team is actively investigating the root cause of the service failure.
Temporary workarounds are being explored to reduce user impact while the service is being restored.

Posted Jan 19, 2025 - 17:25 UTC

This incident affected: API, WebSDK, and MobileSDK.