Database Maintenance Overrun

Write-up

RDS Upgrade Failure in Production US East 1

Description

A planned PostgreSQL major version upgrade (14.7 → 17.9) on all RDS instances in production US East 1 extended far beyond its scheduled maintenance window and ultimately failed due to storage space exhaustion, causing approximately 6 hours of downtime for customers on the us2a deployment. AWS automatically rolled back the affected instances to their pre-upgrade state (PostgreSQL 14.15), restoring service at 10:13 EDT. No data loss occurred.

The upgrade had been successfully completed on sandbox, staging, and multiple production regions, where it finished in under 15 minutes. Daily automated snapshots on this same instance consistently took 10–15 minutes, giving no indication that the pre-upgrade snapshot would behave differently. The ~5 hour snapshot duration was entirely unexpected and could not have been reasonably predicted from prior operations.

Timeline (EDT)

04:12 — ModifyDBInstance API call initiated for upgrade (14.7 → 17.9)
04:13 — Pre-upgrade compatibility checks began on the primary database instance
04:18 — API Gateway and Examples RDS instances completed upgrade successfully
04:51 — Pre-check completed (38 min). Instance shut down, snapshot started
05:28 — Snapshot I/O spike observed — active sequential read at ~45MB/s, write near zero
06:00 — Incident posted on status page
07:26 — Snapshot still in progress (~2.5h elapsed). AWS support ticket opened
08:00 — I/O throughput spiked to ~55MB/s. CPU jumped to ~70%. Appeared to be a phase transition
08:36 — AWS support replied: upgrade progressing normally, no intervention possible, no ETA available
09:22 — Read I/O dropped to near zero. Write spiked to 130MB/s — appeared to be instance recovering
09:37 — RDS upgrade officially failed. Recovery process began automatically
10:13 — Systems back online on PostgreSQL 14.15. Post-rollback snapshot running in background

Impact

All customers on the us2a production deployment were unable to access Superset for approximately 6 hours. Service restored at 10:13 EDT.
No data loss or corruption occurred — AWS automatically rolled back the upgrade, recovering the original database engine without restoring from a snapshot.

Cause Analysis

The upgrade failed because the pre-upgrade snapshot consumed all available free storage on the primary database instance.

Storage Exhaustion

Volume: 500GB, ~440GB used, ~64GB free at the start of the process
The pre-upgrade snapshot ran for approximately 4.5 hours, during which it consumed the remaining ~60GB of free disk space through:
- Full pre-upgrade snapshot temporary files and metadata
- pg_upgrade working files and output logs
- WAL accumulation during the extended snapshot window
Once free storage hit zero, the upgrade process failed

Why the Snapshot Took So Long

RDS takes a full (non-incremental) snapshot before major version upgrades, regardless of any existing snapshots. This was not apparent from prior experience — daily automated snapshots on this instance are incremental and consistently completed in 10–15 minutes. Upgrades on other regions completed their snapshots in similar timeframes. Nothing in our operational history or AWS documentation clearly indicated the pre-upgrade snapshot would take ~4.5 hours on this instance.
S3 snapshot throughput is an AWS-side constraint, capped at ~30–45MB/s regardless of EBS volume provisioned throughput. At 440GB of used data and ~40MB/s, a full snapshot takes ~3 hours minimum — but this throughput ceiling only becomes visible during full (non-incremental) snapshots, which are rare in normal operations.

Resolution

At 09:37 EDT, the upgrade failed with "Postgres cluster is in a state where pg_upgrade cannot be completed successfully." The pre-upgrade snapshot completed at 10:10 EDT, and the instance restarted at 10:13 EDT on PostgreSQL 14.15. The database engine was restored in a consistent state with no data loss. No manual intervention was required. A post-rollback snapshot was taken automatically and completed at 11:29 EDT.

Future Work

We will rehearse the upgrade process against a full production-scale dataset to validate storage and timing estimates before any future major version upgrade attempt.

Write-up

Database Maintenance Overrun

Full outage

View the incident

RDS Upgrade Failure in Production US East 1

Description

Timeline (EDT)

04:12 — ModifyDBInstance API call initiated for upgrade (14.7 → 17.9)
04:13 — Pre-upgrade compatibility checks began on the primary database instance
04:18 — API Gateway and Examples RDS instances completed upgrade successfully
04:51 — Pre-check completed (38 min). Instance shut down, snapshot started
05:28 — Snapshot I/O spike observed — active sequential read at ~45MB/s, write near zero
06:00 — Incident posted on status page
07:26 — Snapshot still in progress (~2.5h elapsed). AWS support ticket opened
08:00 — I/O throughput spiked to ~55MB/s. CPU jumped to ~70%. Appeared to be a phase transition
08:36 — AWS support replied: upgrade progressing normally, no intervention possible, no ETA available
09:22 — Read I/O dropped to near zero. Write spiked to 130MB/s — appeared to be instance recovering
09:37 — RDS upgrade officially failed. Recovery process began automatically
10:13 — Systems back online on PostgreSQL 14.15. Post-rollback snapshot running in background

Impact

All customers on the us2a production deployment were unable to access Superset for approximately 6 hours. Service restored at 10:13 EDT.
No data loss or corruption occurred — AWS automatically rolled back the upgrade, recovering the original database engine without restoring from a snapshot.

Cause Analysis

The upgrade failed because the pre-upgrade snapshot consumed all available free storage on the primary database instance.

Storage Exhaustion

Volume: 500GB, ~440GB used, ~64GB free at the start of the process
The pre-upgrade snapshot ran for approximately 4.5 hours, during which it consumed the remaining ~60GB of free disk space through:
- Full pre-upgrade snapshot temporary files and metadata
- pg_upgrade working files and output logs
- WAL accumulation during the extended snapshot window
Once free storage hit zero, the upgrade process failed

Why the Snapshot Took So Long

RDS takes a full (non-incremental) snapshot before major version upgrades, regardless of any existing snapshots. This was not apparent from prior experience — daily automated snapshots on this instance are incremental and consistently completed in 10–15 minutes. Upgrades on other regions completed their snapshots in similar timeframes. Nothing in our operational history or AWS documentation clearly indicated the pre-upgrade snapshot would take ~4.5 hours on this instance.
S3 snapshot throughput is an AWS-side constraint, capped at ~30–45MB/s regardless of EBS volume provisioned throughput. At 440GB of used data and ~40MB/s, a full snapshot takes ~3 hours minimum — but this throughput ceiling only becomes visible during full (non-incremental) snapshots, which are rare in normal operations.

Resolution

Future Work

We will rehearse the upgrade process against a full production-scale dataset to validate storage and timing estimates before any future major version upgrade attempt.