Cloud based software system outage Postmortem

Table of contents

No heading

No headings in the article.

Summary:

From 2: PM - 4:00 PM UTC, requests to our cloud-based software system returned 500 error response messages, resulting in an interruption of services for several hours for 70% of our clients. The outage was caused by a database query error during maintenance.

Timeline:

  • 2:00 PM: A scheduled maintenance was initiated to update the system's database schema and it would take 30 minutes to complete.

  • 2:30 PM: The maintenance had completed normally and the was running normally.

  • 3:00 PM: Datadog alerted the team of the outage.

  • 3:15 PM: The root cause was found to be a query error.

  • 3:30 PM: A failed attempt to restore the database due to a corrupted backup.

  • 3:40 PM: Successful restoration of the database.

  • 3:42 PM: The server was restarted.

  • 4:00 PM: The system was running perfectly.

Root cause:

The root cause of the outage was identified as a database query error caused by the maintenance task at 2:00 PM UTC. This error caused the database to lock up, preventing any new connections from being made.

Resolution:

Our team was able to restore the database from an earlier backup, but this resulted in a data loss of approximately 2 hours. We will be conducting a full review of our backup processes and implementing additional safeguards to prevent data loss in the future.

Corrective and preventative measures:

in the past few days, we have conducted a full review of our backup processes and we have also implemented additional safeguards to prevent future occurrences of the same kind from happening. These are some of the things to be done.

  • Our monitoring and alerting systems need to be improved in order to have a more swift response and detect similar issues.

  • We need to improve our incidence response plan to be more detailed for database issues.

Lessons learned:

  • We need to improve our backup processes to ensure that we have multiple, reliable backups that can be restored quickly in the event of an outage.

  • Our monitoring and alerting systems need to be improved to detect issues earlier and notify our team before clients are affected.

  • We need to review and update our incident response plan to include more detailed steps for handling database-related issues.

  • Communication with clients needs to be improved, with more timely updates provided during an outage.