Around 15:50 UTC we received the first alerts on increased system load in our platform. Our automatic scaling system attempted to mitigate the problem by increasing the number of backend instances. During this time a significant amount of requests were still going through, but our system was showing an extreme delay performing any of these actions.This prompted our Engineering team to open an incident report and dive into a full scale investigation. As result we found out that IO was the root cause.
It is important to clarify that our backend relies heavily on a distributed file system provided by AWS. We opened a case with AWS, as we worked around the clock on a plan to make our system responsive again, without knowing that the root cause had started on Amazon side.
Here are some of the actions :
1. We replaced the file system for another one with more aggressive settings. That action showed improvement, but unfortunately did not gave us expected results. This forced us to make some changes on the backend to prevent any performance degradation while working without distributed filesystem at all.
2. Around 21:25 UTC - A fix was rolled out. And we confirmed, the changes made in the system were working with the expected performance.
3. Around 00:15 UTC - AWS acknowledged, there were elevated latencies for the file system and and started investigation on their side
4. Around 02:00 UTC - AWS identified the issue’s root cause and confirmed they are working on a fix. There were no further updates from AWS on the case yet. Although, this incident was not yet reflected on the global AWS status page.