TECHNOLOGY

Resend – Incident file for February Twenty first, 2024

Summary (TL;DR)

On February 21st, 2024, Resend experienced an outage that affected all customers resulting from a database migration that went hideous. This shunned customers from using the API (including sending emails) and having access to the dashboard from 05: 01 to 17: 05 UTC (about 12 hours).

The database migration by accident deleted info from manufacturing servers. We immediately began the restoration activity from a backup, which carried out 6 hours later. Unfortunately, once it done, we found that it did no longer restore the guidelines, so we needed to originate the restoration activity a 2d time with a numerous backup.

Throughout this time, no API requests were being current and no info being saved. For info created forward of the migration, there became 5 minute of info loss from when the migration started and the database went offline from 04: 50: 00 to 04: 56: 27 UTC. We’re for the time being engaged on re-populating the guidelines from this 5-minute window.

We sincerely categorical feel sorry about for the affect and grief attributable to this outage. We location gargantuan importance on reliability, however this week, we fell short of our dedication to you all. It is evident that we now have an ideal distance to head in turning into some other-main infrastructure provider, however in finding out from this incident, we’re going to have the capability to attend our operations and tooling to steer sure of outages like this within the waste, in spite of the cause.

Timeline

All times are in Coordinated Universal Time (UTC)

February 21st, 2024

  • 04: 56: Database migration started
  • 04: 57: Noticed tables being dropped from the manufacturing database
  • 05: 01: Began restoring the database from a backup
  • 05: 02: Posted on space web page, updating every 30-60 minutes except resolution
  • 11: 02: First restoration activity carried out
  • 11: 03: Realized the first backup failed and started to analysis
  • 11: 33: Learned that the backup failed resulting from a hideous different of the backup timestamp
  • 11: 48: Increased compute to scoot up the restoration activity – updated database memory from 128GB to 256GB and CPU from 32-core ARM to 64-core ARM
  • 12: 05: Began restoring the database from an older backup
  • 17: 01: Second restoration activity carried out
  • 17: 02: API began receiving requests
  • 17: 05: Dashboard became accessible again, and incident became resolved

What came about

Whereas building a feature, we performed a database migration affirm domestically, however it indubitably incorrectly pointed to the manufacturing atmosphere as an different, which dropped all tables in manufacturing.

The first strive to restore the database took 6 hours however failed resulting from a hideous different of the backup timestamp. The 2d strive to restore took an additional 5 hours and succeeded, bringing all info attend moreover a 5-minute window of info loss.

Impact

All customers were unable to send emails, use the API, or salvage entry to the Resend dashboard for 12 hours from 05: 01 to 17: 05 UTC.

For info created forward of the migration, there became 5 minutes of info loss from when the migration started and the database went offline from 04: 50: 00 to 04: 56: 27 UTC.

Subsequent steps and enhancements

  • Re-populate info from the 5-minute window of info loss.

  • No accessible user feature will must have write privileges on the manufacturing database.

  • Toughen local development to diminish risks connected to database migrations.

  • Construct redundancy to assign sending feature even at some level of a database outage.

  • Amplify cadence for danger recovery exams.

  • Implement incident banner on Resend dashboard to repeat customers instant.

To our customers, we are deeply sorry that this incident came about and that it shunned you from delivering your mission-indispensable emails. We know that actions be in contact louder than phrases, so we’re going to have the capability to continue to be taught, grow, and attend, initiating by imposing the enhancements listed above.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button