Preset

Report a problemSubscribe to updates
Powered by
Privacy policy

·

Terms of service
Write-up
Database Maintenance Overrun
Full outage
View the incident
RDS Upgrade Failure in Production US East 1
Description

A planned PostgreSQL major version upgrade (14.7 → 17.9) on all RDS instances in production US East 1 extended far beyond its scheduled maintenance window and ultimately failed due to storage space exhaustion, causing approximately 6 hours of downtime for customers on the us2a deployment. AWS automatically rolled back the affected instances to their pre-upgrade state (PostgreSQL 14.15), restoring service at 10:13 EDT. No data loss occurred.

The upgrade had been successfully completed on sandbox, staging, and multiple production regions, where it finished in under 15 minutes. Daily automated snapshots on this same instance consistently took 10–15 minutes, giving no indication that the pre-upgrade snapshot would behave differently. The ~5 hour snapshot duration was entirely unexpected and could not have been reasonably predicted from prior operations.

Timeline (EDT)
  • 04:12 — ModifyDBInstance API call initiated for upgrade (14.7 → 17.9)

  • 04:13 — Pre-upgrade compatibility checks began on the primary database instance

  • 04:18 — API Gateway and Examples RDS instances completed upgrade successfully

  • 04:51 — Pre-check completed (38 min). Instance shut down, snapshot started

  • 05:28 — Snapshot I/O spike observed — active sequential read at ~45MB/s, write near zero

  • 06:00 — Incident posted on status page

  • 07:26 — Snapshot still in progress (~2.5h elapsed). AWS support ticket opened

  • 08:00 — I/O throughput spiked to ~55MB/s. CPU jumped to ~70%. Appeared to be a phase transition

  • 08:36 — AWS support replied: upgrade progressing normally, no intervention possible, no ETA available

  • 09:22 — Read I/O dropped to near zero. Write spiked to 130MB/s — appeared to be instance recovering

  • 09:37 — RDS upgrade officially failed. Recovery process began automatically

  • 10:13 — Systems back online on PostgreSQL 14.15. Post-rollback snapshot running in background

Impact
  • All customers on the us2a production deployment were unable to access Superset for approximately 6 hours. Service restored at 10:13 EDT.

  • No data loss or corruption occurred — AWS automatically rolled back the upgrade, recovering the original database engine without restoring from a snapshot.

Cause Analysis

The upgrade failed because the pre-upgrade snapshot consumed all available free storage on the primary database instance.

Storage Exhaustion
  • Volume: 500GB, ~440GB used, ~64GB free at the start of the process

  • The pre-upgrade snapshot ran for approximately 4.5 hours, during which it consumed the remaining ~60GB of free disk space through:

    • Full pre-upgrade snapshot temporary files and metadata

    • pg_upgrade working files and output logs

    • WAL accumulation during the extended snapshot window

  • Once free storage hit zero, the upgrade process failed

Why the Snapshot Took So Long
  • RDS takes a full (non-incremental) snapshot before major version upgrades, regardless of any existing snapshots. This was not apparent from prior experience — daily automated snapshots on this instance are incremental and consistently completed in 10–15 minutes. Upgrades on other regions completed their snapshots in similar timeframes. Nothing in our operational history or AWS documentation clearly indicated the pre-upgrade snapshot would take ~4.5 hours on this instance.

  • S3 snapshot throughput is an AWS-side constraint, capped at ~30–45MB/s regardless of EBS volume provisioned throughput. At 440GB of used data and ~40MB/s, a full snapshot takes ~3 hours minimum — but this throughput ceiling only becomes visible during full (non-incremental) snapshots, which are rare in normal operations.

Resolution

At 09:37 EDT, the upgrade failed with "Postgres cluster is in a state where pg_upgrade cannot be completed successfully." The pre-upgrade snapshot completed at 10:10 EDT, and the instance restarted at 10:13 EDT on PostgreSQL 14.15. The database engine was restored in a consistent state with no data loss. No manual intervention was required. A post-rollback snapshot was taken automatically and completed at 11:29 EDT.

Future Work

We will rehearse the upgrade process against a full production-scale dataset to validate storage and timing estimates before any future major version upgrade attempt.