NodeCraft.com 06/24/16 Outage Postmortem | Service Updates

As of 2:35am CST on 06/24/2016, there was an outage with both NodeCraft.com, its API’s and several automated systems which we rely on for critical system stability. The primary cause of this outage has been identified as a runtime crash in our database engine, and as a result of the outage, our API and automated systems were unavailable for 34 minutes, but was restored via our system recovery automation. The database failed to automatically recover and continued to prevent users from using the control panel or website. At 7:06 AM CST our database was recovered by our staff, which restored system functionality. Our team then proceeded to push a previously pending update out to all host nodes before performing a system-wide reboot. As of 9:00 AM, all systems had been completely restored and were at full functionality.

Results of Investigation

After restoring system functionality and a quick run to grab caffeine and sugary snacks, we performed a several hour investigation aimed at identifying the initial cause of the outage, side effects, and building a prevention plan to ensure it will not occur again. As described above, we have identified that our database engine failed. Our team is working to deploy a clustered database runtime which will scale as our system requires additional resources so that it’s more fault tolerant to similar outages in the future. We will be working on an update which will deploy the cluster with a small outage of 5-10 minutes in the near-future. We will post additional updates as we near the deployment date of this update.

Side Effect: Loss of Offsite Backups

Further investigation of the incident revealed that an edge-case bug occurred in our automation. One script, which scans our offsite backups for orphaned archives (backup files which have no matching database entries, or backups for no longer active services) had a significant bug which was triggered while the database became unavailable. This particular script scans the entire list of backups stored in Amazon S3 and compares the list to the database. In a race condition, this script occurred during the initial phases of the database outage and created a list where in the significant majority of offsite backups didn’t have a matching database entry, or at least the script couldn’t locate one, as a result of the outage. This resulted in the mass deletion of nearly 57% of all customer services’ backups to the offsite storage. Once our automation ran again (it runs on a 3 hour loop), it was able to connect to the database again, however it discovered that the same backups previously deleted from Amazon S3 now had a missing file and deleted the entries from the database. Because of the way that the backups were flagged for deletion, the glitch ignored locked backups and deleted all it marked as infant indiscriminately. We have since disabled this script until we can implement a full patch which will provide sanity checks, flagging files for delete before removing them, and a mode which will automatically report to our team in the event something similar should occur again. We will keep our customers updated as we are ready to deploy this update.

We are prepared to issue refunds or credit to our customers who lost backups or suffered from service outages during this incident, and all customers affected have been notified via email. Please submit a support ticket via our Support Center and our billing team will work directly with you to ensure you are properly compensated.

Response Time

Our last outage was caused by our API servers crashing without a recovery system and required system personnel. For this instance, the problem could not be automatically resolved and directly required system personnel. We have failed to implement a successful pager system to contact or wake the relevant personnel as both of these incidents have directly pointed out. While we had many monitoring systems in place, they simply didn’t make sure that someone had flagged the issue as pending; roughly 4 hours is not an acceptable response time.

Today we have setup a 3rd party system to alert “on call” staff members to outages so that they can be identified and managed within minutes. This alerting system works across our entire team if a single individual is unable to respond and will eventually reach our entire team if needed. We pledge to avoid extended outages similar to this one as we can definitely do better.

Conclusion

Transparency is one of our most important core values as a service provider. It’s important that we keep our customer-base up to date with problems and changes to our service so they can make informed decisions. The reality is that outages similar to this one are guaranteed across any platform or product published however, it’s our responsibility to create plans and protocols in place to minimize the affect it has on customers. We are confident that the changes we’ve made today will create a more stable product and reduce the downtime for any outages moving forward. If you have any further questions regarding this incident or your backups please contact our team and we’ll be happy to help where we can.

We truly apologise for the inconvenience this downtime caused many of our customers. We hope that you accept our most humblest of apologies, and will continue to support us and our service as you have done so diligently in the past.