QUIC.cloud Service Outage 2023-09-07

Summary

An unexpected issue occurred today during regular maintenance, causing widespread downtime for all domains using QUIC.cloud CDN. The downtime was caused by a temporary loss of data, which needed manual intervention to repopulate. This was completed within one hour and forty-five minutes, by which time all domains were again fully operational.

Details

We deployed a routine maintenance run on the fleet. Due to a degraded replica, our database masters lost connection to each other. Part of that database logic includes discarding the old master data, and the net result is that the DNS data was wiped from Redis.

To resolve this, we cleared the sync state in our data conversion tool to force it to repopulate the entire database. This was the initial fix that brought back all the domains and such, but left some CNAMES still broken.

The CNAME issue was caused by data entry issues on two domains, specifically an extra space in the IP value. When we ran a data verification, these domains failed the re-load. We deleted the records responsible, removed the spaces, and repopulated again to bring the remaining domains online .

Timeline

  • Outage first reported: 9:13 UTC
  • Majority of domains brought back online: 10:07 UTC
  • Remaining domains operational: 10:55 UTC

Future Plans

We have learned from this incident, and intend to implement the following changes:

  • We will revamp the maintenance procedure for this type of server. This means, adding more alerts to detect any problems early, and implementing more disaster recovery tests.
  • We have already patched our CNAME code to ensure that no future records may be saved to the database with extra spaces.

We appreciate your trust and your loyalty, and we wholeheartedly apologize for any inconvenience this outage may have caused you. Please do not hesitate to contact support if you are experiencing any residual issues.

Leave a Comment