Resolved -
We found a missing configuration that did not carry over from the old version of the database to the new. We added this configuration on Friday around 6pm Pacific, which has fixed the problem with our big data operations on our Subscription service database for all clients.
We monitored over the weekend, and there were no further errors. This issue has been resolved.
We have taken steps to make sure all of our databases have this configuration and will be monitoring any similar issue with other databases going forward.
Apr 22, 11:25 PDT
Identified -
The issue has been identified and a fix is being implemented.
Apr 19, 15:32 PDT
Update -
We are continuing to investigate. The issue seems to have started on April 14th when we applied a AWS-required MySQL 5.7 => 8 upgrade to our Subscription service database. This has apparently caused some unforeseen performance issues when running multiple sites' machine learning jobs.
We will be upgrading the database instance size in order to sidestep the space issue temporarily. Our hypothesis is that this will buy us some time and (hopefully) allow our big jobs to continue running.
Meanwhile, we will be investigating how to make the disk usage more efficient, or resolve the issue overall.
Apr 19, 15:32 PDT
Investigating -
We are currently investigating, but it seems one of our databases has run out of temporary disk space to unload large tables to our machine learning algorithm. Not all sites are affected, and it seems to primarily be an issue for larger sites (many millions of users).
We are looking to remediate this, but also we're trying to find out why this suddenly started happening even though we haven't changed much on the database server side. We will update here with more findings as we have them.
Apr 19, 14:20 PDT